2017-12-18 06:00:59 +03:00
// SPDX-License-Identifier: GPL-2.0
2006-10-11 12:20:50 +04:00
/*
2006-10-11 12:20:53 +04:00
* linux / fs / ext4 / super . c
2006-10-11 12:20:50 +04:00
*
* Copyright ( C ) 1992 , 1993 , 1994 , 1995
* Remy Card ( card @ masi . ibp . fr )
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie ( Paris VI )
*
* from
*
* linux / fs / minix / inode . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* Big - endian to little - endian byte - swapping / bitmaps by
* David S . Miller ( davem @ caip . rutgers . edu ) , 1995
*/
# include <linux/module.h>
# include <linux/string.h>
# include <linux/fs.h>
# include <linux/time.h>
2009-04-28 06:48:48 +04:00
# include <linux/vmalloc.h>
2006-10-11 12:20:50 +04:00
# include <linux/slab.h>
# include <linux/init.h>
# include <linux/blkdev.h>
2015-05-23 00:13:32 +03:00
# include <linux/backing-dev.h>
2006-10-11 12:20:50 +04:00
# include <linux/parser.h>
# include <linux/buffer_head.h>
2007-07-17 15:04:28 +04:00
# include <linux/exportfs.h>
2006-10-11 12:20:50 +04:00
# include <linux/vfs.h>
# include <linux/random.h>
# include <linux/mount.h>
# include <linux/namei.h>
# include <linux/quotaops.h>
# include <linux/seq_file.h>
2009-03-31 17:10:09 +04:00
# include <linux/ctype.h>
2007-07-18 17:11:02 +04:00
# include <linux/log2.h>
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
# include <linux/crc16.h>
2017-05-08 20:55:27 +03:00
# include <linux/dax.h>
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2018-01-09 16:21:39 +03:00
# include <linux/iversion.h>
2019-04-25 21:05:42 +03:00
# include <linux/unicode.h>
2020-03-25 18:48:42 +03:00
# include <linux/part_stat.h>
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
# include <linux/kthread.h>
# include <linux/freezer.h>
2021-10-25 22:27:44 +03:00
# include <linux/fsnotify.h>
2021-10-27 17:18:46 +03:00
# include <linux/fs_context.h>
# include <linux/fs_parser.h>
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2008-04-30 02:13:32 +04:00
# include "ext4.h"
2012-11-28 22:03:30 +04:00
# include "ext4_extents.h" /* Needed for trace points definition */
2008-04-30 02:13:32 +04:00
# include "ext4_jbd2.h"
2006-10-11 12:20:50 +04:00
# include "xattr.h"
# include "acl.h"
2009-09-15 06:59:50 +04:00
# include "mballoc.h"
2017-04-30 07:36:53 +03:00
# include "fsmap.h"
2006-10-11 12:20:50 +04:00
2009-06-17 19:48:11 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/ext4.h>
2011-02-23 20:22:49 +03:00
static struct ext4_lazy_init * ext4_li_info ;
2020-12-24 16:22:44 +03:00
static DEFINE_MUTEX ( ext4_li_mtx ) ;
2015-08-15 21:59:44 +03:00
static struct ratelimit_state ext4_mount_msg_ratelimit ;
2008-09-23 17:18:24 +04:00
2006-10-11 12:20:53 +04:00
static int ext4_load_journal ( struct super_block * , struct ext4_super_block * ,
2006-10-11 12:20:50 +04:00
unsigned long journal_devnum ) ;
2012-03-04 08:20:50 +04:00
static int ext4_show_options ( struct seq_file * seq , struct dentry * root ) ;
2020-12-16 13:18:40 +03:00
static void ext4_update_super ( struct super_block * sb ) ;
2020-12-16 13:18:38 +03:00
static int ext4_commit_super ( struct super_block * sb ) ;
2020-07-10 17:07:59 +03:00
static int ext4_mark_recovery_complete ( struct super_block * sb ,
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es ) ;
2020-07-10 17:07:59 +03:00
static int ext4_clear_journal_err ( struct super_block * sb ,
struct ext4_super_block * es ) ;
2006-10-11 12:20:53 +04:00
static int ext4_sync_fs ( struct super_block * sb , int wait ) ;
2008-07-27 00:15:44 +04:00
static int ext4_statfs ( struct dentry * dentry , struct kstatfs * buf ) ;
2009-01-10 03:40:58 +03:00
static int ext4_unfreeze ( struct super_block * sb ) ;
static int ext4_freeze ( struct super_block * sb ) ;
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb ) ;
static inline int ext3_feature_set_ok ( struct super_block * sb ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
static void ext4_destroy_lazyinit_thread ( void ) ;
static void ext4_unregister_li_request ( struct super_block * sb ) ;
2011-02-03 22:33:15 +03:00
static void ext4_clear_request_list ( void ) ;
2016-09-30 09:05:09 +03:00
static struct inode * ext4_get_journal_inode ( struct super_block * sb ,
unsigned int journal_inum ) ;
2021-10-27 17:18:49 +03:00
static int ext4_validate_options ( struct fs_context * fc ) ;
2021-10-27 17:18:51 +03:00
static int ext4_check_opt_consistency ( struct fs_context * fc ,
struct super_block * sb ) ;
2022-05-26 07:04:12 +03:00
static void ext4_apply_options ( struct fs_context * fc , struct super_block * sb ) ;
2021-10-27 17:18:54 +03:00
static int ext4_parse_param ( struct fs_context * fc , struct fs_parameter * param ) ;
2021-10-27 17:18:56 +03:00
static int ext4_get_tree ( struct fs_context * fc ) ;
static int ext4_reconfigure ( struct fs_context * fc ) ;
static void ext4_fc_free ( struct fs_context * fc ) ;
static int ext4_init_fs_context ( struct fs_context * fc ) ;
static const struct fs_parameter_spec ext4_param_specs [ ] ;
2006-10-11 12:20:50 +04:00
2015-12-07 22:35:49 +03:00
/*
* Lock ordering
*
* page fault path :
2021-02-04 20:05:42 +03:00
* mmap_lock - > sb_start_pagefault - > invalidate_lock ( r ) - > transaction start
* - > page lock - > i_data_sem ( rw )
2015-12-07 22:35:49 +03:00
*
* buffered write path :
2020-06-09 07:33:54 +03:00
* sb_start_write - > i_mutex - > mmap_lock
2015-12-07 22:35:49 +03:00
* sb_start_write - > i_mutex - > transaction start - > page lock - >
* i_data_sem ( rw )
*
* truncate :
2021-02-04 20:05:42 +03:00
* sb_start_write - > i_mutex - > invalidate_lock ( w ) - > i_mmap_rwsem ( w ) - >
* page lock
* sb_start_write - > i_mutex - > invalidate_lock ( w ) - > transaction start - >
2018-03-22 18:52:10 +03:00
* i_data_sem ( rw )
2015-12-07 22:35:49 +03:00
*
* direct IO :
2020-06-09 07:33:54 +03:00
* sb_start_write - > i_mutex - > mmap_lock
2018-03-22 18:52:10 +03:00
* sb_start_write - > i_mutex - > transaction start - > i_data_sem ( rw )
2015-12-07 22:35:49 +03:00
*
* writepages :
* transaction start - > page lock ( s ) - > i_data_sem ( rw )
*/
2021-10-27 17:18:54 +03:00
static const struct fs_context_operations ext4_context_ops = {
. parse_param = ext4_parse_param ,
2021-10-27 17:18:56 +03:00
. get_tree = ext4_get_tree ,
. reconfigure = ext4_reconfigure ,
. free = ext4_fc_free ,
2021-10-27 17:18:54 +03:00
} ;
2015-06-18 17:52:29 +03:00
# if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
2011-04-19 01:29:14 +04:00
static struct file_system_type ext2_fs_type = {
2021-10-27 17:18:56 +03:00
. owner = THIS_MODULE ,
. name = " ext2 " ,
. init_fs_context = ext4_init_fs_context ,
. parameters = ext4_param_specs ,
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV ,
2011-04-19 01:29:14 +04:00
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext2 " ) ;
2013-03-13 05:27:41 +04:00
MODULE_ALIAS ( " ext2 " ) ;
2011-04-19 01:29:14 +04:00
# define IS_EXT2_SB(sb) ((sb)->s_bdev->bd_holder == &ext2_fs_type)
# else
# define IS_EXT2_SB(sb) (0)
# endif
2010-03-25 03:18:37 +03:00
static struct file_system_type ext3_fs_type = {
2021-10-27 17:18:56 +03:00
. owner = THIS_MODULE ,
. name = " ext3 " ,
. init_fs_context = ext4_init_fs_context ,
. parameters = ext4_param_specs ,
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV ,
2010-03-25 03:18:37 +03:00
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext3 " ) ;
2013-03-13 05:27:41 +04:00
MODULE_ALIAS ( " ext3 " ) ;
2010-03-25 03:18:37 +03:00
# define IS_EXT3_SB(sb) ((sb)->s_bdev->bd_holder == &ext3_fs_type)
2006-10-11 12:21:10 +04:00
2020-09-24 10:33:32 +03:00
2022-07-14 21:07:17 +03:00
static inline void __ext4_read_bh ( struct buffer_head * bh , blk_opf_t op_flags ,
2020-09-24 10:33:32 +03:00
bh_end_io_t * end_io )
{
/*
* buffer ' s verified bit is no longer valid after reading from
* disk again due to write out error , clear it to make sure we
* recheck the buffer contents .
*/
clear_buffer_verified ( bh ) ;
bh - > b_end_io = end_io ? end_io : end_buffer_read_sync ;
get_bh ( bh ) ;
2022-07-14 21:07:13 +03:00
submit_bh ( REQ_OP_READ | op_flags , bh ) ;
2020-09-24 10:33:32 +03:00
}
2022-07-14 21:07:17 +03:00
void ext4_read_bh_nowait ( struct buffer_head * bh , blk_opf_t op_flags ,
2020-09-24 10:33:32 +03:00
bh_end_io_t * end_io )
{
BUG_ON ( ! buffer_locked ( bh ) ) ;
if ( ext4_buffer_uptodate ( bh ) ) {
unlock_buffer ( bh ) ;
return ;
}
__ext4_read_bh ( bh , op_flags , end_io ) ;
}
2022-07-14 21:07:17 +03:00
int ext4_read_bh ( struct buffer_head * bh , blk_opf_t op_flags , bh_end_io_t * end_io )
2020-09-24 10:33:32 +03:00
{
BUG_ON ( ! buffer_locked ( bh ) ) ;
if ( ext4_buffer_uptodate ( bh ) ) {
unlock_buffer ( bh ) ;
return 0 ;
}
__ext4_read_bh ( bh , op_flags , end_io ) ;
wait_on_buffer ( bh ) ;
if ( buffer_uptodate ( bh ) )
return 0 ;
return - EIO ;
}
2022-07-14 21:07:17 +03:00
int ext4_read_bh_lock ( struct buffer_head * bh , blk_opf_t op_flags , bool wait )
2020-09-24 10:33:32 +03:00
{
if ( trylock_buffer ( bh ) ) {
if ( wait )
return ext4_read_bh ( bh , op_flags , NULL ) ;
ext4_read_bh_nowait ( bh , op_flags , NULL ) ;
return 0 ;
}
if ( wait ) {
wait_on_buffer ( bh ) ;
if ( buffer_uptodate ( bh ) )
return 0 ;
return - EIO ;
}
return 0 ;
}
2018-11-26 01:20:31 +03:00
/*
2020-09-24 10:33:37 +03:00
* This works like __bread_gfp ( ) except it uses ERR_PTR for error
2018-11-26 01:20:31 +03:00
* returns . Currently with sb_bread it ' s impossible to distinguish
* between ENOMEM and EIO situations ( since both result in a NULL
* return .
*/
2020-09-24 10:33:37 +03:00
static struct buffer_head * __ext4_sb_bread_gfp ( struct super_block * sb ,
2022-07-14 21:07:17 +03:00
sector_t block ,
blk_opf_t op_flags , gfp_t gfp )
2018-11-26 01:20:31 +03:00
{
2020-09-24 10:33:33 +03:00
struct buffer_head * bh ;
int ret ;
2018-11-26 01:20:31 +03:00
2020-09-24 10:33:37 +03:00
bh = sb_getblk_gfp ( sb , block , gfp ) ;
2018-11-26 01:20:31 +03:00
if ( bh = = NULL )
return ERR_PTR ( - ENOMEM ) ;
2019-12-15 00:42:52 +03:00
if ( ext4_buffer_uptodate ( bh ) )
2018-11-26 01:20:31 +03:00
return bh ;
2020-09-24 10:33:33 +03:00
ret = ext4_read_bh_lock ( bh , REQ_META | op_flags , true ) ;
if ( ret ) {
put_bh ( bh ) ;
return ERR_PTR ( ret ) ;
}
return bh ;
2018-11-26 01:20:31 +03:00
}
2020-09-24 10:33:37 +03:00
struct buffer_head * ext4_sb_bread ( struct super_block * sb , sector_t block ,
2022-07-14 21:07:17 +03:00
blk_opf_t op_flags )
2020-09-24 10:33:37 +03:00
{
return __ext4_sb_bread_gfp ( sb , block , op_flags , __GFP_MOVABLE ) ;
}
struct buffer_head * ext4_sb_bread_unmovable ( struct super_block * sb ,
sector_t block )
{
return __ext4_sb_bread_gfp ( sb , block , 0 , 0 ) ;
}
2020-09-24 10:33:35 +03:00
void ext4_sb_breadahead_unmovable ( struct super_block * sb , sector_t block )
{
struct buffer_head * bh = sb_getblk_gfp ( sb , block , 0 ) ;
if ( likely ( bh ) ) {
ext4_read_bh_lock ( bh , REQ_RAHEAD , false ) ;
brelse ( bh ) ;
}
2018-11-26 01:20:31 +03:00
}
2012-04-30 02:25:10 +04:00
static int ext4_verify_csum_type ( struct super_block * sb ,
struct ext4_super_block * es )
{
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_metadata_csum ( sb ) )
2012-04-30 02:25:10 +04:00
return 1 ;
return es - > s_checksum_type = = EXT4_CRC32C_CHKSUM ;
}
2021-12-13 16:56:18 +03:00
__le32 ext4_superblock_csum ( struct super_block * sb ,
struct ext4_super_block * es )
2012-04-30 02:29:10 +04:00
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
int offset = offsetof ( struct ext4_super_block , s_checksum ) ;
__u32 csum ;
csum = ext4_chksum ( sbi , ~ 0 , ( char * ) es , offset ) ;
return cpu_to_le32 ( csum ) ;
}
2014-05-12 18:50:23 +04:00
static int ext4_superblock_csum_verify ( struct super_block * sb ,
struct ext4_super_block * es )
2012-04-30 02:29:10 +04:00
{
2014-10-13 11:36:16 +04:00
if ( ! ext4_has_metadata_csum ( sb ) )
2012-04-30 02:29:10 +04:00
return 1 ;
return es - > s_checksum = = ext4_superblock_csum ( sb , es ) ;
}
2012-10-10 09:06:58 +04:00
void ext4_superblock_csum_set ( struct super_block * sb )
2012-04-30 02:29:10 +04:00
{
2012-10-10 09:06:58 +04:00
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2014-10-13 11:36:16 +04:00
if ( ! ext4_has_metadata_csum ( sb ) )
2012-04-30 02:29:10 +04:00
return ;
es - > s_checksum = ext4_superblock_csum ( sb , es ) ;
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_block_bitmap ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_block_bitmap_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_block_bitmap_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_inode_bitmap ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_inode_bitmap_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_inode_bitmap_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_inode_table ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_inode_table_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_inode_table_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2011-09-10 03:08:51 +04:00
__u32 ext4_free_group_clusters ( struct super_block * sb ,
struct ext4_group_desc * bg )
2009-01-06 06:20:24 +03:00
{
return le16_to_cpu ( bg - > bg_free_blocks_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_free_blocks_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_free_inodes_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_free_inodes_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_free_inodes_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_used_dirs_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_used_dirs_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_used_dirs_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_itable_unused_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_itable_unused_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_itable_unused_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
2006-10-11 12:21:15 +04:00
void ext4_block_bitmap_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_block_bitmap_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_block_bitmap_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
void ext4_inode_bitmap_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_inode_bitmap_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_inode_bitmap_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
void ext4_inode_table_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_inode_table_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_inode_table_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2011-09-10 03:08:51 +04:00
void ext4_free_group_clusters_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
2009-01-06 06:20:24 +03:00
{
bg - > bg_free_blocks_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_free_blocks_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_free_inodes_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_free_inodes_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_free_inodes_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_used_dirs_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_used_dirs_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_used_dirs_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_itable_unused_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_itable_unused_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_itable_unused_hi = cpu_to_le16 ( count > > 16 ) ;
}
2020-11-27 14:34:00 +03:00
static void __ext4_update_tstamp ( __le32 * lo , __u8 * hi , time64_t now )
2018-07-29 22:51:48 +03:00
{
now = clamp_val ( now , 0 , ( 1ull < < 40 ) - 1 ) ;
* lo = cpu_to_le32 ( lower_32_bits ( now ) ) ;
* hi = upper_32_bits ( now ) ;
}
static time64_t __ext4_get_tstamp ( __le32 * lo , __u8 * hi )
{
return ( ( time64_t ) ( * hi ) < < 32 ) + le32_to_cpu ( * lo ) ;
}
# define ext4_update_tstamp(es, tstamp) \
2020-11-27 14:34:00 +03:00
__ext4_update_tstamp ( & ( es ) - > tstamp , & ( es ) - > tstamp # # _hi , \
ktime_get_real_seconds ( ) )
2018-07-29 22:51:48 +03:00
# define ext4_get_tstamp(es, tstamp) \
__ext4_get_tstamp ( & ( es ) - > tstamp , & ( es ) - > tstamp # # _hi )
2009-09-29 19:01:03 +04:00
2015-08-16 17:03:57 +03:00
/*
* The del_gendisk ( ) function uninitializes the disk - specific data
* structures , including the bdi structure , without telling anyone
* else . Once this happens , any attempt to call mark_buffer_dirty ( )
* ( for example , by ext4_commit_super ) , will cause a kernel OOPS .
* This is a kludge to prevent these oops until we can put in a proper
* hook in del_gendisk ( ) to inform the VFS and file system layers .
*/
static int block_device_ejected ( struct super_block * sb )
{
struct inode * bd_inode = sb - > s_bdev - > bd_inode ;
struct backing_dev_info * bdi = inode_to_bdi ( bd_inode ) ;
return bdi - > dev = = NULL ;
}
2012-02-21 02:53:02 +04:00
static void ext4_journal_commit_callback ( journal_t * journal , transaction_t * txn )
{
struct super_block * sb = journal - > j_private ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
int error = is_journal_aborted ( journal ) ;
2013-04-04 06:08:52 +04:00
struct ext4_journal_cb_entry * jce ;
2012-02-21 02:53:02 +04:00
2013-04-04 06:08:52 +04:00
BUG_ON ( txn - > t_state = = T_FINISHED ) ;
2017-06-23 06:54:33 +03:00
ext4_process_freed_data ( sb , txn - > t_tid ) ;
2012-02-21 02:53:02 +04:00
spin_lock ( & sbi - > s_md_lock ) ;
2013-04-04 06:08:52 +04:00
while ( ! list_empty ( & txn - > t_private_list ) ) {
jce = list_entry ( txn - > t_private_list . next ,
struct ext4_journal_cb_entry , jce_list ) ;
2012-02-21 02:53:02 +04:00
list_del_init ( & jce - > jce_list ) ;
spin_unlock ( & sbi - > s_md_lock ) ;
jce - > jce_func ( sb , jce , error ) ;
spin_lock ( & sbi - > s_md_lock ) ;
}
spin_unlock ( & sbi - > s_md_lock ) ;
}
2010-07-27 19:56:03 +04:00
2020-10-06 03:48:41 +03:00
/*
* This writepage callback for write_cache_pages ( )
* takes care of a few cases after page cleaning .
*
* write_cache_pages ( ) already checks for dirty pages
* and calls clear_page_dirty_for_io ( ) , which we want ,
* to write protect the pages .
*
* However , we may have to redirty a page ( see below . )
*/
static int ext4_journalled_writepage_callback ( struct page * page ,
struct writeback_control * wbc ,
void * data )
{
transaction_t * transaction = ( transaction_t * ) data ;
struct buffer_head * bh , * head ;
struct journal_head * jh ;
bh = head = page_buffers ( page ) ;
do {
/*
* We have to redirty a page in these cases :
* 1 ) If buffer is dirty , it means the page was dirty because it
* contains a buffer that needs checkpointing . So the dirty bit
* needs to be preserved so that checkpointing writes the buffer
* properly .
* 2 ) If buffer is not part of the committing transaction
* ( we may have just accidentally come across this buffer because
* inode range tracking is not exact ) or if the currently running
* transaction already contains this buffer as well , dirty bit
* needs to be preserved so that the buffer gets writeprotected
* properly on running transaction ' s commit .
*/
jh = bh2jh ( bh ) ;
if ( buffer_dirty ( bh ) | |
( jh & & ( jh - > b_transaction ! = transaction | |
jh - > b_next_transaction ) ) ) {
redirty_page_for_writepage ( wbc , page ) ;
goto out ;
}
} while ( ( bh = bh - > b_this_page ) ! = head ) ;
out :
return AOP_WRITEPAGE_ACTIVATE ;
}
static int ext4_journalled_submit_inode_data_buffers ( struct jbd2_inode * jinode )
{
struct address_space * mapping = jinode - > i_vfs_inode - > i_mapping ;
struct writeback_control wbc = {
. sync_mode = WB_SYNC_ALL ,
. nr_to_write = LONG_MAX ,
. range_start = jinode - > i_dirty_start ,
. range_end = jinode - > i_dirty_end ,
} ;
return write_cache_pages ( mapping , & wbc ,
ext4_journalled_writepage_callback ,
jinode - > i_transaction ) ;
}
static int ext4_journal_submit_inode_data_buffers ( struct jbd2_inode * jinode )
{
int ret ;
if ( ext4_should_journal_data ( jinode - > i_vfs_inode ) )
ret = ext4_journalled_submit_inode_data_buffers ( jinode ) ;
else
ret = jbd2_journal_submit_inode_data_buffers ( jinode ) ;
return ret ;
}
static int ext4_journal_finish_inode_data_buffers ( struct jbd2_inode * jinode )
{
int ret = 0 ;
if ( ! ext4_should_journal_data ( jinode - > i_vfs_inode ) )
ret = jbd2_journal_finish_inode_data_buffers ( jinode ) ;
return ret ;
}
2019-03-15 06:46:05 +03:00
static bool system_going_down ( void )
{
return system_state = = SYSTEM_HALT | | system_state = = SYSTEM_POWER_OFF
| | system_state = = SYSTEM_RESTART ;
}
2020-11-27 14:33:59 +03:00
struct ext4_err_translation {
int code ;
int errno ;
} ;
# define EXT4_ERR_TRANSLATE(err) { .code = EXT4_ERR_##err, .errno = err }
static struct ext4_err_translation err_translation [ ] = {
EXT4_ERR_TRANSLATE ( EIO ) ,
EXT4_ERR_TRANSLATE ( ENOMEM ) ,
EXT4_ERR_TRANSLATE ( EFSBADCRC ) ,
EXT4_ERR_TRANSLATE ( EFSCORRUPTED ) ,
EXT4_ERR_TRANSLATE ( ENOSPC ) ,
EXT4_ERR_TRANSLATE ( ENOKEY ) ,
EXT4_ERR_TRANSLATE ( EROFS ) ,
EXT4_ERR_TRANSLATE ( EFBIG ) ,
EXT4_ERR_TRANSLATE ( EEXIST ) ,
EXT4_ERR_TRANSLATE ( ERANGE ) ,
EXT4_ERR_TRANSLATE ( EOVERFLOW ) ,
EXT4_ERR_TRANSLATE ( EBUSY ) ,
EXT4_ERR_TRANSLATE ( ENOTDIR ) ,
EXT4_ERR_TRANSLATE ( ENOTEMPTY ) ,
EXT4_ERR_TRANSLATE ( ESHUTDOWN ) ,
EXT4_ERR_TRANSLATE ( EFAULT ) ,
} ;
static int ext4_errno_to_code ( int errno )
{
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( err_translation ) ; i + + )
if ( err_translation [ i ] . errno = = errno )
return err_translation [ i ] . code ;
return EXT4_ERR_UNKNOWN ;
}
2020-12-16 13:18:40 +03:00
static void save_error_info ( struct super_block * sb , int error ,
__u32 ino , __u64 block ,
const char * func , unsigned int line )
2020-11-27 14:33:58 +03:00
{
2020-11-27 14:34:00 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2020-11-27 14:33:58 +03:00
2020-11-27 14:33:59 +03:00
/* We default to EFSCORRUPTED error... */
if ( error = = 0 )
error = EFSCORRUPTED ;
2020-11-27 14:34:00 +03:00
spin_lock ( & sbi - > s_error_lock ) ;
sbi - > s_add_error_count + + ;
sbi - > s_last_error_code = error ;
sbi - > s_last_error_line = line ;
sbi - > s_last_error_ino = ino ;
sbi - > s_last_error_block = block ;
sbi - > s_last_error_func = func ;
sbi - > s_last_error_time = ktime_get_real_seconds ( ) ;
if ( ! sbi - > s_first_error_time ) {
sbi - > s_first_error_code = error ;
sbi - > s_first_error_line = line ;
sbi - > s_first_error_ino = ino ;
sbi - > s_first_error_block = block ;
sbi - > s_first_error_func = func ;
sbi - > s_first_error_time = sbi - > s_last_error_time ;
}
spin_unlock ( & sbi - > s_error_lock ) ;
2020-11-27 14:33:58 +03:00
}
2006-10-11 12:20:50 +04:00
/* Deal with the reporting of failure conditions on a filesystem such as
* inconsistencies detected or read IO failures .
*
* On ext2 , we can store the error state of the filesystem in the
2006-10-11 12:20:53 +04:00
* superblock . That is not possible on ext4 , because we may have other
2006-10-11 12:20:50 +04:00
* write ordering constraints on the superblock which prevent us from
* writing it out straight away ; and given that the journal is about to
* be aborted , we can ' t rely on the current , or future , transactions to
* write out the superblock safely .
*
2006-10-11 12:21:01 +04:00
* We ' ll just use the jbd2_journal_abort ( ) error code to record an error in
2010-01-18 00:10:07 +03:00
* the journal instead . On recovery , the journal will complain about
2006-10-11 12:20:50 +04:00
* that error until we ' ve noted it down and cleared it .
2020-11-27 14:33:57 +03:00
*
* If force_ro is set , we unconditionally force the filesystem into an
* ABORT | READONLY state , unless the error response on the fs has been set to
* panic in which case we take the easy way out and panic immediately . This is
* used to deal with unrecoverable failures such as journal IO errors or ENOMEM
* at a critical moment in log management .
2006-10-11 12:20:50 +04:00
*/
2020-12-16 13:18:37 +03:00
static void ext4_handle_error ( struct super_block * sb , bool force_ro , int error ,
__u32 ino , __u64 block ,
const char * func , unsigned int line )
2006-10-11 12:20:50 +04:00
{
2020-11-27 14:33:54 +03:00
journal_t * journal = EXT4_SB ( sb ) - > s_journal ;
2020-12-16 13:18:40 +03:00
bool continue_fs = ! force_ro & & test_opt ( sb , ERRORS_CONT ) ;
2020-11-27 14:33:54 +03:00
2020-12-16 13:18:37 +03:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ERROR_FS ;
2018-06-13 06:34:57 +03:00
if ( test_opt ( sb , WARN_ON_ERROR ) )
WARN_ON_ONCE ( 1 ) ;
2020-12-16 13:18:40 +03:00
if ( ! continue_fs & & ! sb_rdonly ( sb ) ) {
ext4_set_mount_flag ( sb , EXT4_MF_FS_ABORTED ) ;
if ( journal )
jbd2_journal_abort ( journal , - EIO ) ;
}
if ( ! bdev_read_only ( sb - > s_bdev ) ) {
2020-12-16 13:18:37 +03:00
save_error_info ( sb , error , ino , block , func , line ) ;
2020-12-16 13:18:40 +03:00
/*
* In case the fs should keep running , we need to writeout
* superblock through the journal . Due to lock ordering
* constraints , it may not be safe to do it right here so we
* defer superblock flushing to a workqueue .
*/
2021-09-24 12:39:17 +03:00
if ( continue_fs & & journal )
2020-12-16 13:18:40 +03:00
schedule_work ( & EXT4_SB ( sb ) - > s_error_work ) ;
else
ext4_commit_super ( sb ) ;
}
2020-12-16 13:18:37 +03:00
2019-03-15 06:46:05 +03:00
/*
* We force ERRORS_RO behavior when system is rebooting . Otherwise we
* could panic during ' reboot - f ' as the underlying device got already
* disabled .
*/
2020-11-27 14:33:57 +03:00
if ( test_opt ( sb , ERRORS_PANIC ) & & ! system_going_down ( ) ) {
2006-10-11 12:20:53 +04:00
panic ( " EXT4-fs (device %s): panic forced after error \n " ,
2006-10-11 12:20:50 +04:00
sb - > s_id ) ;
2015-10-19 00:02:56 +03:00
}
ext4: always panic when errors=panic is specified
Before commit 014c9caa29d3 ("ext4: make ext4_abort() use
__ext4_error()"), the following series of commands would trigger a
panic:
1. mount /dev/sda -o ro,errors=panic test
2. mount /dev/sda -o remount,abort test
After commit 014c9caa29d3, remounting a file system using the test
mount option "abort" will no longer trigger a panic. This commit will
restore the behaviour immediately before commit 014c9caa29d3.
(However, note that the Linux kernel's behavior has not been
consistent; some previous kernel versions, including 5.4 and 4.19
similarly did not panic after using the mount option "abort".)
This also makes a change to long-standing behaviour; namely, the
following series commands will now cause a panic, when previously it
did not:
1. mount /dev/sda -o ro,errors=panic test
2. echo test > /sys/fs/ext4/sda/trigger_fs_error
However, this makes ext4's behaviour much more consistent, so this is
a good thing.
Cc: stable@kernel.org
Fixes: 014c9caa29d3 ("ext4: make ext4_abort() use __ext4_error()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 11:19:03 +03:00
if ( sb_rdonly ( sb ) | | continue_fs )
return ;
2020-11-27 14:33:57 +03:00
ext4_msg ( sb , KERN_CRIT , " Remounting filesystem read-only " ) ;
/*
* Make sure updated value of - > s_mount_flags will be visible before
* - > s_flags update
*/
smp_wmb ( ) ;
sb - > s_flags | = SB_RDONLY ;
2006-10-11 12:20:50 +04:00
}
2020-11-27 14:34:00 +03:00
static void flush_stashed_error_work ( struct work_struct * work )
{
struct ext4_sb_info * sbi = container_of ( work , struct ext4_sb_info ,
s_error_work ) ;
2020-12-16 13:18:40 +03:00
journal_t * journal = sbi - > s_journal ;
handle_t * handle ;
2020-11-27 14:34:00 +03:00
2020-12-16 13:18:40 +03:00
/*
* If the journal is still running , we have to write out superblock
* through the journal to avoid collisions of other journalled sb
* updates .
*
* We use directly jbd2 functions here to avoid recursing back into
* ext4 error handling code during handling of previous errors .
*/
if ( ! sb_rdonly ( sbi - > s_sb ) & & journal ) {
2021-06-15 12:05:37 +03:00
struct buffer_head * sbh = sbi - > s_sbh ;
2020-12-16 13:18:40 +03:00
handle = jbd2_journal_start ( journal , 1 ) ;
if ( IS_ERR ( handle ) )
goto write_directly ;
2021-06-15 12:05:37 +03:00
if ( jbd2_journal_get_write_access ( handle , sbh ) ) {
2020-12-16 13:18:40 +03:00
jbd2_journal_stop ( handle ) ;
goto write_directly ;
}
ext4_update_super ( sbi - > s_sb ) ;
2021-06-15 12:05:37 +03:00
if ( buffer_write_io_error ( sbh ) | | ! buffer_uptodate ( sbh ) ) {
ext4_msg ( sbi - > s_sb , KERN_ERR , " previous I/O error to "
" superblock detected " ) ;
clear_buffer_write_io_error ( sbh ) ;
set_buffer_uptodate ( sbh ) ;
}
if ( jbd2_journal_dirty_metadata ( handle , sbh ) ) {
2020-12-16 13:18:40 +03:00
jbd2_journal_stop ( handle ) ;
goto write_directly ;
}
jbd2_journal_stop ( handle ) ;
2021-06-11 17:02:08 +03:00
ext4_notify_error_sysfs ( sbi ) ;
2020-12-16 13:18:40 +03:00
return ;
}
write_directly :
/*
* Write through journal failed . Write sb directly to get error info
* out and hope for the best .
*/
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sbi - > s_sb ) ;
2021-06-11 17:02:08 +03:00
ext4_notify_error_sysfs ( sbi ) ;
2006-10-11 12:20:50 +04:00
}
2013-10-18 05:11:01 +04:00
# define ext4_error_ratelimit(sb) \
___ratelimit ( & ( EXT4_SB ( sb ) - > s_err_ratelimit_state ) , \
" EXT4-fs error " )
2010-02-15 22:19:27 +03:00
void __ext4_error ( struct super_block * sb , const char * function ,
2020-11-27 14:33:57 +03:00
unsigned int line , bool force_ro , int error , __u64 block ,
2020-03-29 02:33:43 +03:00
const char * fmt , . . . )
2006-10-11 12:20:50 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2006-10-11 12:20:50 +04:00
va_list args ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2018-02-19 04:53:23 +03:00
trace_ext4_error ( sb , function , line ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: comm %s: %pV \n " ,
sb - > s_id , function , line , current - > comm , & vaf ) ;
va_end ( args ) ;
}
2021-10-25 22:27:44 +03:00
fsnotify_sb_error ( sb , NULL , error ? error : EFSCORRUPTED ) ;
2020-12-16 13:18:37 +03:00
ext4_handle_error ( sb , force_ro , error , 0 , block , function , line ) ;
2006-10-11 12:20:50 +04:00
}
2013-07-01 16:12:37 +04:00
void __ext4_error_inode ( struct inode * inode , const char * function ,
2020-03-29 02:33:43 +03:00
unsigned int line , ext4_fsblk_t block , int error ,
2013-07-01 16:12:37 +04:00
const char * fmt , . . . )
2010-03-02 19:46:09 +03:00
{
va_list args ;
2011-01-10 20:10:55 +03:00
struct va_format vaf ;
2010-03-02 19:46:09 +03:00
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( inode - > i_sb ) ) ) )
return ;
2018-02-19 04:53:23 +03:00
trace_ext4_error ( inode - > i_sb , function , line ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( inode - > i_sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
if ( block )
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: "
" inode #%lu: block %llu: comm %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
block , current - > comm , & vaf ) ;
else
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: "
" inode #%lu: comm %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
current - > comm , & vaf ) ;
va_end ( args ) ;
}
2021-10-25 22:27:44 +03:00
fsnotify_sb_error ( inode - > i_sb , inode , error ? error : EFSCORRUPTED ) ;
2020-12-16 13:18:37 +03:00
ext4_handle_error ( inode - > i_sb , false , error , inode - > i_ino , block ,
function , line ) ;
2010-03-02 19:46:09 +03:00
}
2013-07-01 16:12:37 +04:00
void __ext4_error_file ( struct file * file , const char * function ,
unsigned int line , ext4_fsblk_t block ,
const char * fmt , . . . )
2010-03-02 19:46:09 +03:00
{
va_list args ;
2011-01-10 20:10:55 +03:00
struct va_format vaf ;
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2010-03-02 19:46:09 +03:00
char pathname [ 80 ] , * path ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( inode - > i_sb ) ) ) )
return ;
2018-02-19 04:53:23 +03:00
trace_ext4_error ( inode - > i_sb , function , line ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( inode - > i_sb ) ) {
2015-06-19 11:29:13 +03:00
path = file_path ( file , pathname , sizeof ( pathname ) ) ;
2013-10-18 05:11:01 +04:00
if ( IS_ERR ( path ) )
path = " (unknown) " ;
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
if ( block )
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: inode #%lu: "
" block %llu: comm %s: path %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
block , current - > comm , path , & vaf ) ;
else
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: inode #%lu: "
" comm %s: path %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
current - > comm , path , & vaf ) ;
va_end ( args ) ;
}
2021-10-25 22:27:44 +03:00
fsnotify_sb_error ( inode - > i_sb , inode , EFSCORRUPTED ) ;
2020-12-16 13:18:37 +03:00
ext4_handle_error ( inode - > i_sb , false , EFSCORRUPTED , inode - > i_ino , block ,
function , line ) ;
2010-03-02 19:46:09 +03:00
}
2013-02-08 22:00:31 +04:00
const char * ext4_decode_error ( struct super_block * sb , int errno ,
char nbuf [ 16 ] )
2006-10-11 12:20:50 +04:00
{
char * errstr = NULL ;
switch ( errno ) {
2015-10-17 23:16:04 +03:00
case - EFSCORRUPTED :
errstr = " Corrupt filesystem " ;
break ;
case - EFSBADCRC :
errstr = " Filesystem failed CRC " ;
break ;
2006-10-11 12:20:50 +04:00
case - EIO :
errstr = " IO failure " ;
break ;
case - ENOMEM :
errstr = " Out of memory " ;
break ;
case - EROFS :
2009-07-28 07:09:47 +04:00
if ( ! sb | | ( EXT4_SB ( sb ) - > s_journal & &
EXT4_SB ( sb ) - > s_journal - > j_flags & JBD2_ABORT ) )
2006-10-11 12:20:50 +04:00
errstr = " Journal has aborted " ;
else
errstr = " Readonly filesystem " ;
break ;
default :
/* If the caller passed in an extra buffer for unknown
* errors , textualise them now . Else we just return
* NULL . */
if ( nbuf ) {
/* Check for truncated error codes... */
if ( snprintf ( nbuf , 16 , " error %d " , - errno ) > = 0 )
errstr = nbuf ;
}
break ;
}
return errstr ;
}
2006-10-11 12:20:53 +04:00
/* __ext4_std_error decodes expected errors from journaling functions
2006-10-11 12:20:50 +04:00
* automatically and invokes the appropriate error response . */
2010-07-27 19:56:40 +04:00
void __ext4_std_error ( struct super_block * sb , const char * function ,
unsigned int line , int errno )
2006-10-11 12:20:50 +04:00
{
char nbuf [ 16 ] ;
const char * errstr ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2006-10-11 12:20:50 +04:00
/* Special case: if the error is EROFS, and we're not already
* inside a transaction , then there ' s really no point in logging
* an error . */
2017-07-17 10:45:34 +03:00
if ( errno = = - EROFS & & journal_current_handle ( ) = = NULL & & sb_rdonly ( sb ) )
2006-10-11 12:20:50 +04:00
return ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
errstr = ext4_decode_error ( sb , errno , nbuf ) ;
printk ( KERN_CRIT " EXT4-fs error (device %s) in %s:%d: %s \n " ,
sb - > s_id , function , line , errstr ) ;
}
2021-10-25 22:27:44 +03:00
fsnotify_sb_error ( sb , NULL , errno ? errno : EFSCORRUPTED ) ;
2006-10-11 12:20:50 +04:00
2020-12-16 13:18:37 +03:00
ext4_handle_error ( sb , false , - errno , 0 , 0 , function , line ) ;
2006-10-11 12:20:50 +04:00
}
2013-07-01 16:12:37 +04:00
void __ext4_msg ( struct super_block * sb ,
const char * prefix , const char * fmt , . . . )
2009-06-05 01:36:36 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2009-06-05 01:36:36 +04:00
va_list args ;
2021-10-27 17:18:49 +03:00
if ( sb ) {
atomic_inc ( & EXT4_SB ( sb ) - > s_msg_count ) ;
if ( ! ___ratelimit ( & ( EXT4_SB ( sb ) - > s_msg_ratelimit_state ) ,
" EXT4-fs " ) )
return ;
}
2013-10-18 05:11:01 +04:00
2009-06-05 01:36:36 +04:00
va_start ( args , fmt ) ;
2010-12-20 06:43:19 +03:00
vaf . fmt = fmt ;
vaf . va = & args ;
2021-10-27 17:18:49 +03:00
if ( sb )
printk ( " %sEXT4-fs (%s): %pV \n " , prefix , sb - > s_id , & vaf ) ;
else
printk ( " %sEXT4-fs: %pV \n " , prefix , & vaf ) ;
2009-06-05 01:36:36 +04:00
va_end ( args ) ;
}
2020-07-25 15:33:13 +03:00
static int ext4_warning_ratelimit ( struct super_block * sb )
{
atomic_inc ( & EXT4_SB ( sb ) - > s_warning_count ) ;
return ___ratelimit ( & ( EXT4_SB ( sb ) - > s_warning_ratelimit_state ) ,
" EXT4-fs warning " ) ;
}
2015-06-15 21:50:26 +03:00
2010-02-15 22:19:27 +03:00
void __ext4_warning ( struct super_block * sb , const char * function ,
2010-07-27 19:56:40 +04:00
unsigned int line , const char * fmt , . . . )
2006-10-11 12:20:50 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2006-10-11 12:20:50 +04:00
va_list args ;
2015-06-15 21:50:26 +03:00
if ( ! ext4_warning_ratelimit ( sb ) )
2013-10-18 05:11:01 +04:00
return ;
2006-10-11 12:20:50 +04:00
va_start ( args , fmt ) ;
2010-12-20 06:43:19 +03:00
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_WARNING " EXT4-fs warning (device %s): %s:%d: %pV \n " ,
sb - > s_id , function , line , & vaf ) ;
2006-10-11 12:20:50 +04:00
va_end ( args ) ;
}
2015-06-15 21:50:26 +03:00
void __ext4_warning_inode ( const struct inode * inode , const char * function ,
unsigned int line , const char * fmt , . . . )
{
struct va_format vaf ;
va_list args ;
if ( ! ext4_warning_ratelimit ( inode - > i_sb ) )
return ;
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_WARNING " EXT4-fs warning (device %s): %s:%d: "
" inode #%lu: comm %s: %pV \n " , inode - > i_sb - > s_id ,
function , line , inode - > i_ino , current - > comm , & vaf ) ;
va_end ( args ) ;
}
2010-06-29 20:54:28 +04:00
void __ext4_grp_locked_error ( const char * function , unsigned int line ,
struct super_block * sb , ext4_group_t grp ,
unsigned long ino , ext4_fsblk_t block ,
const char * fmt , . . . )
2009-01-06 06:19:52 +03:00
__releases ( bitlock )
__acquires ( bitlock )
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2009-01-06 06:19:52 +03:00
va_list args ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2018-02-19 04:53:23 +03:00
trace_ext4_error ( sb , function , line ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: group %u, " ,
sb - > s_id , function , line , grp ) ;
if ( ino )
printk ( KERN_CONT " inode %lu: " , ino ) ;
if ( block )
printk ( KERN_CONT " block %llu: " ,
( unsigned long long ) block ) ;
printk ( KERN_CONT " %pV \n " , & vaf ) ;
va_end ( args ) ;
}
2009-01-06 06:19:52 +03:00
if ( test_opt ( sb , ERRORS_CONT ) ) {
2020-11-27 14:34:00 +03:00
if ( test_opt ( sb , WARN_ON_ERROR ) )
WARN_ON_ONCE ( 1 ) ;
2020-12-16 13:18:37 +03:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ERROR_FS ;
2020-12-16 13:18:40 +03:00
if ( ! bdev_read_only ( sb - > s_bdev ) ) {
save_error_info ( sb , EFSCORRUPTED , ino , block , function ,
line ) ;
2020-12-16 13:18:37 +03:00
schedule_work ( & EXT4_SB ( sb ) - > s_error_work ) ;
2020-12-16 13:18:40 +03:00
}
2009-01-06 06:19:52 +03:00
return ;
}
ext4_unlock_group ( sb , grp ) ;
2020-12-16 13:18:37 +03:00
ext4_handle_error ( sb , false , EFSCORRUPTED , ino , block , function , line ) ;
2009-01-06 06:19:52 +03:00
/*
* We only get here in the ERRORS_RO case ; relocking the group
* may be dangerous , but nothing bad will happen since the
* filesystem will have already been marked read / only and the
* journal has been aborted . We return 1 as a hint to callers
* who might what to use the return value from
2011-03-31 05:57:33 +04:00
* ext4_grp_locked_error ( ) to distinguish between the
2009-01-06 06:19:52 +03:00
* ERRORS_CONT and ERRORS_RO case , and perhaps return more
* aggressively from the ext4 function in question , with a
* more appropriate error code .
*/
ext4_lock_group ( sb , grp ) ;
return ;
}
2018-05-12 18:39:40 +03:00
void ext4_mark_group_bitmap_corrupted ( struct super_block * sb ,
ext4_group_t group ,
unsigned int flags )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_info * grp = ext4_get_group_info ( sb , group ) ;
struct ext4_group_desc * gdp = ext4_get_group_desc ( sb , group , NULL ) ;
2018-07-30 00:27:45 +03:00
int ret ;
if ( flags & EXT4_GROUP_INFO_BBITMAP_CORRUPT ) {
ret = ext4_test_and_set_bit ( EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT ,
& grp - > bb_state ) ;
if ( ! ret )
percpu_counter_sub ( & sbi - > s_freeclusters_counter ,
grp - > bb_free ) ;
2018-05-12 18:39:40 +03:00
}
2018-07-30 00:27:45 +03:00
if ( flags & EXT4_GROUP_INFO_IBITMAP_CORRUPT ) {
ret = ext4_test_and_set_bit ( EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT ,
& grp - > bb_state ) ;
if ( ! ret & & gdp ) {
2018-05-12 18:39:40 +03:00
int count ;
count = ext4_free_inodes_count ( sb , gdp ) ;
percpu_counter_sub ( & sbi - > s_freeinodes_counter ,
count ) ;
}
}
}
2006-10-11 12:20:53 +04:00
void ext4_update_dynamic_rev ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) > EXT4_GOOD_OLD_REV )
2006-10-11 12:20:50 +04:00
return ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb ,
2006-10-11 12:20:50 +04:00
" updating to rev %d because of new feature flag, "
" running e2fsck is recommended " ,
2006-10-11 12:20:53 +04:00
EXT4_DYNAMIC_REV ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
es - > s_first_ino = cpu_to_le32 ( EXT4_GOOD_OLD_FIRST_INO ) ;
es - > s_inode_size = cpu_to_le16 ( EXT4_GOOD_OLD_INODE_SIZE ) ;
es - > s_rev_level = cpu_to_le32 ( EXT4_DYNAMIC_REV ) ;
2006-10-11 12:20:50 +04:00
/* leave es->s_feature_*compat flags alone */
/* es->s_uuid will be set by e2fsck if empty */
/*
* The rest of the superblock fields should be zero , and if not it
* means they are likely already in use , so leave them alone . We
* can leave it up to e2fsck to clean up any inconsistencies there .
*/
}
/*
* Open the external journal device
*/
2009-06-05 01:36:36 +04:00
static struct block_device * ext4_blkdev_get ( dev_t dev , struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
struct block_device * bdev ;
2010-11-13 13:55:18 +03:00
bdev = blkdev_get_by_dev ( dev , FMODE_READ | FMODE_WRITE | FMODE_EXCL , sb ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( bdev ) )
goto fail ;
return bdev ;
fail :
2020-03-24 10:25:11 +03:00
ext4_msg ( sb , KERN_ERR ,
" failed to open journal device unknown-block(%u,%u) %ld " ,
MAJOR ( dev ) , MINOR ( dev ) , PTR_ERR ( bdev ) ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
/*
* Release the journal device
*/
2013-05-06 06:11:03 +04:00
static void ext4_blkdev_put ( struct block_device * bdev )
2006-10-11 12:20:50 +04:00
{
2013-05-06 06:11:03 +04:00
blkdev_put ( bdev , FMODE_READ | FMODE_WRITE | FMODE_EXCL ) ;
2006-10-11 12:20:50 +04:00
}
2013-05-06 06:11:03 +04:00
static void ext4_blkdev_remove ( struct ext4_sb_info * sbi )
2006-10-11 12:20:50 +04:00
{
struct block_device * bdev ;
2020-09-24 06:03:42 +03:00
bdev = sbi - > s_journal_bdev ;
2006-10-11 12:20:50 +04:00
if ( bdev ) {
2013-05-06 06:11:03 +04:00
ext4_blkdev_put ( bdev ) ;
2020-09-24 06:03:42 +03:00
sbi - > s_journal_bdev = NULL ;
2006-10-11 12:20:50 +04:00
}
}
static inline struct inode * orphan_list_entry ( struct list_head * l )
{
2006-10-11 12:20:53 +04:00
return & list_entry ( l , struct ext4_inode_info , i_orphan ) - > vfs_inode ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static void dump_orphan_list ( struct super_block * sb , struct ext4_sb_info * sbi )
2006-10-11 12:20:50 +04:00
{
struct list_head * l ;
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " sb orphan head is %d " ,
le32_to_cpu ( sbi - > s_es - > s_last_orphan ) ) ;
2006-10-11 12:20:50 +04:00
printk ( KERN_ERR " sb_info orphan list: \n " ) ;
list_for_each ( l , & sbi - > s_orphan ) {
struct inode * inode = orphan_list_entry ( l ) ;
printk ( KERN_ERR " "
" inode %s:%lu at %p: mode %o, nlink %d, next %d \n " ,
inode - > i_sb - > s_id , inode - > i_ino , inode ,
inode - > i_mode , inode - > i_nlink ,
NEXT_ORPHAN ( inode ) ) ;
}
}
2017-04-06 16:40:06 +03:00
# ifdef CONFIG_QUOTA
static int ext4_quota_off ( struct super_block * sb , int type ) ;
static inline void ext4_quota_off_umount ( struct super_block * sb )
{
int type ;
2017-05-22 05:31:23 +03:00
/* Use our quota_off function to clear inode flags etc. */
for ( type = 0 ; type < EXT4_MAXQUOTAS ; type + + )
ext4_quota_off ( sb , type ) ;
2017-04-06 16:40:06 +03:00
}
2018-10-12 16:28:09 +03:00
/*
* This is a helper function which is used in the mount / remount
* codepaths ( which holds s_umount ) to fetch the quota file name .
*/
static inline char * get_qf_name ( struct super_block * sb ,
struct ext4_sb_info * sbi ,
int type )
{
return rcu_dereference_protected ( sbi - > s_qf_names [ type ] ,
lockdep_is_held ( & sb - > s_umount ) ) ;
}
2017-04-06 16:40:06 +03:00
# else
static inline void ext4_quota_off_umount ( struct super_block * sb )
{
}
# endif
2008-07-27 00:15:44 +04:00
static void ext4_put_super ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2020-02-16 00:40:37 +03:00
struct buffer_head * * group_desc ;
2020-02-19 06:08:51 +03:00
struct flex_groups * * flex_groups ;
2017-02-05 07:38:06 +03:00
int aborted = 0 ;
2008-10-28 05:53:05 +03:00
int i , err ;
2006-10-11 12:20:50 +04:00
2020-03-18 09:13:01 +03:00
/*
* Unregister sysfs before destroying jbd2 journal .
* Since we could still access attr_journal_task attribute via sysfs
* path which could have sbi - > s_journal - > j_task as NULL
2022-03-22 04:24:19 +03:00
* Unregister sysfs before flush sbi - > s_error_work .
* Since user may read / proc / fs / ext4 / xx / mb_groups during umount , If
* read metadata verify failed then will queue error work .
* flush_stashed_error_work will call start_this_handle may trigger
* BUG_ON .
2020-03-18 09:13:01 +03:00
*/
ext4_unregister_sysfs ( sb ) ;
2022-04-12 17:53:20 +03:00
if ( ___ratelimit ( & ext4_mount_msg_ratelimit , " EXT4-fs unmount " ) )
ext4_msg ( sb , KERN_INFO , " unmounting filesystem. " ) ;
2022-03-22 04:24:19 +03:00
ext4_unregister_li_request ( sb ) ;
ext4_quota_off_umount ( sb ) ;
flush_work ( & sbi - > s_error_work ) ;
destroy_workqueue ( sbi - > rsv_conversion_wq ) ;
ext4_release_orphan_info ( sb ) ;
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal ) {
2017-02-05 07:38:06 +03:00
aborted = is_journal_aborted ( sbi - > s_journal ) ;
2009-01-07 08:06:22 +03:00
err = jbd2_journal_destroy ( sbi - > s_journal ) ;
sbi - > s_journal = NULL ;
2019-11-20 05:54:15 +03:00
if ( ( err < 0 ) & & ! aborted ) {
2020-03-29 02:33:43 +03:00
ext4_abort ( sb , - err , " Couldn't clean up the journal " ) ;
2019-11-20 05:54:15 +03:00
}
2009-01-07 08:06:22 +03:00
}
2009-12-09 05:48:58 +03:00
2013-07-01 16:12:37 +04:00
ext4_es_unregister_shrinker ( sbi ) ;
2013-12-09 05:52:31 +04:00
del_timer_sync ( & sbi - > s_err_report ) ;
2009-12-09 05:48:58 +03:00
ext4_release_system_zone ( sb ) ;
ext4_mb_release ( sb ) ;
ext4_ext_release ( sb ) ;
2017-07-17 10:45:34 +03:00
if ( ! sb_rdonly ( sb ) & & ! aborted ) {
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2021-08-16 12:57:06 +03:00
ext4_clear_feature_orphan_present ( sb ) ;
2006-10-11 12:20:50 +04:00
es - > s_state = cpu_to_le16 ( sbi - > s_mount_state ) ;
}
2017-07-17 10:45:34 +03:00
if ( ! sb_rdonly ( sb ) )
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sb ) ;
2012-03-22 06:29:15 +04:00
2020-02-16 00:40:37 +03:00
rcu_read_lock ( ) ;
group_desc = rcu_dereference ( sbi - > s_group_desc ) ;
2006-10-11 12:20:50 +04:00
for ( i = 0 ; i < sbi - > s_gdb_count ; i + + )
2020-02-16 00:40:37 +03:00
brelse ( group_desc [ i ] ) ;
kvfree ( group_desc ) ;
2020-02-19 06:08:51 +03:00
flex_groups = rcu_dereference ( sbi - > s_flex_groups ) ;
if ( flex_groups ) {
for ( i = 0 ; i < sbi - > s_flex_groups_allocated ; i + + )
kvfree ( flex_groups [ i ] ) ;
kvfree ( flex_groups ) ;
}
2020-02-16 00:40:37 +03:00
rcu_read_unlock ( ) ;
2011-09-10 02:56:51 +04:00
percpu_counter_destroy ( & sbi - > s_freeclusters_counter ) ;
2006-10-11 12:20:50 +04:00
percpu_counter_destroy ( & sbi - > s_freeinodes_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirs_counter ) ;
2011-09-10 02:56:51 +04:00
percpu_counter_destroy ( & sbi - > s_dirtyclusters_counter ) ;
2021-02-18 18:11:32 +03:00
percpu_counter_destroy ( & sbi - > s_sra_exceeded_retry_limit ) ;
2020-02-19 21:30:46 +03:00
percpu_free_rwsem ( & sbi - > s_writepages_rwsem ) ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2018-10-12 16:28:09 +03:00
kfree ( get_qf_name ( sb , sbi , i ) ) ;
2006-10-11 12:20:50 +04:00
# endif
/* Debugging code just in case the in-memory inode orphan list
* isn ' t empty . The on - disk one can be non - empty if we ' ve
* detected an error and taken the fs readonly , but the
* in - memory list had better be clean by this point . */
if ( ! list_empty ( & sbi - > s_orphan ) )
dump_orphan_list ( sb , sbi ) ;
2020-11-07 18:58:11 +03:00
ASSERT ( list_empty ( & sbi - > s_orphan ) ) ;
2006-10-11 12:20:50 +04:00
2015-06-21 05:50:33 +03:00
sync_blockdev ( sb - > s_bdev ) ;
2007-05-07 01:49:54 +04:00
invalidate_bdev ( sb - > s_bdev ) ;
2020-09-24 06:03:42 +03:00
if ( sbi - > s_journal_bdev & & sbi - > s_journal_bdev ! = sb - > s_bdev ) {
2006-10-11 12:20:50 +04:00
/*
* Invalidate the journal device ' s buffers . We don ' t want them
* floating about in memory - the physical journal device may
* hotswapped , and it breaks the ` ro - after ' testing code .
*/
2020-09-24 06:03:42 +03:00
sync_blockdev ( sbi - > s_journal_bdev ) ;
invalidate_bdev ( sbi - > s_journal_bdev ) ;
2006-10-11 12:20:53 +04:00
ext4_blkdev_remove ( sbi ) ;
2006-10-11 12:20:50 +04:00
}
2018-12-04 08:24:42 +03:00
ext4_xattr_destroy_cache ( sbi - > s_ea_inode_cache ) ;
sbi - > s_ea_inode_cache = NULL ;
ext4_xattr_destroy_cache ( sbi - > s_ea_block_cache ) ;
sbi - > s_ea_block_cache = NULL ;
2021-04-30 21:50:46 +03:00
ext4_stop_mmpd ( sbi ) ;
2016-11-26 22:24:51 +03:00
brelse ( sbi - > s_sbh ) ;
2006-10-11 12:20:50 +04:00
sb - > s_fs_info = NULL ;
2009-03-31 17:10:09 +04:00
/*
* Now that we are completely done shutting down the
* superblock , we need to actually destroy the kobject .
*/
kobject_put ( & sbi - > s_kobj ) ;
wait_for_completion ( & sbi - > s_kobj_unregister ) ;
2012-04-30 02:27:10 +04:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2009-02-16 02:07:52 +03:00
kfree ( sbi - > s_blockgroup_lock ) ;
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-03 08:37:25 +03:00
fs_put_dax ( sbi - > s_daxdev , NULL ) ;
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 07:11:35 +03:00
fscrypt_free_dummy_policy ( & sbi - > s_dummy_enc_policy ) ;
2022-01-18 09:56:14 +03:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-10-28 08:08:20 +03:00
utf8_unload ( sb - > s_encoding ) ;
2019-04-25 21:05:42 +03:00
# endif
2006-10-11 12:20:50 +04:00
kfree ( sbi ) ;
}
2006-12-07 07:33:20 +03:00
static struct kmem_cache * ext4_inode_cachep ;
2006-10-11 12:20:50 +04:00
/*
* Called inside transaction , so use GFP_NOFS
*/
2006-10-11 12:20:53 +04:00
static struct inode * ext4_alloc_inode ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_inode_info * ei ;
2006-10-11 12:20:50 +04:00
2022-03-23 00:41:03 +03:00
ei = alloc_inode_sb ( sb , ext4_inode_cachep , GFP_NOFS ) ;
2006-10-11 12:20:50 +04:00
if ( ! ei )
return NULL ;
2009-06-04 01:59:28 +04:00
2018-01-09 16:21:39 +03:00
inode_set_iversion ( & ei - > vfs_inode , 1 ) ;
2014-04-21 22:37:55 +04:00
spin_lock_init ( & ei - > i_raw_lock ) ;
2008-01-29 08:19:52 +03:00
INIT_LIST_HEAD ( & ei - > i_prealloc_list ) ;
2020-08-17 10:36:15 +03:00
atomic_set ( & ei - > i_prealloc_active , 0 ) ;
2008-01-29 08:19:52 +03:00
spin_lock_init ( & ei - > i_prealloc_lock ) ;
2012-11-09 06:57:30 +04:00
ext4_es_init_tree ( & ei - > i_es_tree ) ;
rwlock_init ( & ei - > i_es_lock ) ;
2014-11-25 19:45:37 +03:00
INIT_LIST_HEAD ( & ei - > i_es_list ) ;
2014-09-02 06:26:49 +04:00
ei - > i_es_all_nr = 0 ;
2014-11-25 19:45:37 +03:00
ei - > i_es_shk_nr = 0 ;
2014-11-25 19:51:23 +03:00
ei - > i_es_shrink_lblk = 0 ;
2008-07-15 01:52:37 +04:00
ei - > i_reserved_data_blocks = 0 ;
spin_lock_init ( & ( ei - > i_block_reservation_lock ) ) ;
2018-10-01 21:17:41 +03:00
ext4_init_pending_tree ( & ei - > i_pending_tree ) ;
2009-12-14 15:21:14 +03:00
# ifdef CONFIG_QUOTA
ei - > i_reserved_quota = 0 ;
2014-09-29 16:58:25 +04:00
memset ( & ei - > i_dquot , 0 , sizeof ( ei - > i_dquot ) ) ;
2009-12-14 15:21:14 +03:00
# endif
2011-01-10 20:29:43 +03:00
ei - > jinode = NULL ;
2013-06-04 22:21:02 +04:00
INIT_LIST_HEAD ( & ei - > i_rsv_conversion_list ) ;
2010-03-05 00:14:02 +03:00
spin_lock_init ( & ei - > i_completed_io_lock ) ;
2009-12-09 07:51:10 +03:00
ei - > i_sync_tid = 0 ;
ei - > i_datasync_tid = 0 ;
2012-09-29 07:24:52 +04:00
atomic_set ( & ei - > i_unwritten , 0 ) ;
2013-06-04 22:21:02 +04:00
INIT_WORK ( & ei - > i_rsv_conversion_work , ext4_end_io_rsv_work ) ;
2020-10-15 23:37:57 +03:00
ext4_fc_init_inode ( & ei - > vfs_inode ) ;
mutex_init ( & ei - > i_fc_lock ) ;
2006-10-11 12:20:50 +04:00
return & ei - > vfs_inode ;
}
2010-11-08 21:51:33 +03:00
static int ext4_drop_inode ( struct inode * inode )
{
int drop = generic_drop_inode ( inode ) ;
2019-08-05 05:35:48 +03:00
if ( ! drop )
drop = fscrypt_drop_inode ( inode ) ;
2010-11-08 21:51:33 +03:00
trace_ext4_drop_inode ( inode , drop ) ;
return drop ;
}
2019-04-16 02:28:34 +03:00
static void ext4_free_in_core_inode ( struct inode * inode )
2011-01-07 09:49:49 +03:00
{
2019-04-10 23:21:15 +03:00
fscrypt_free_inode ( inode ) ;
2020-10-15 23:37:57 +03:00
if ( ! list_empty ( & ( EXT4_I ( inode ) - > i_fc_list ) ) ) {
pr_warn ( " %s: inode %ld still in fc list " ,
__func__ , inode - > i_ino ) ;
}
2011-01-07 09:49:49 +03:00
kmem_cache_free ( ext4_inode_cachep , EXT4_I ( inode ) ) ;
}
2006-10-11 12:20:53 +04:00
static void ext4_destroy_inode ( struct inode * inode )
2006-10-11 12:20:50 +04:00
{
2007-07-16 10:40:45 +04:00
if ( ! list_empty ( & ( EXT4_I ( inode ) - > i_orphan ) ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( inode - > i_sb , KERN_ERR ,
" Inode %lu (%p): orphan list check failed! " ,
inode - > i_ino , EXT4_I ( inode ) ) ;
2007-07-16 10:40:45 +04:00
print_hex_dump ( KERN_INFO , " " , DUMP_PREFIX_ADDRESS , 16 , 4 ,
EXT4_I ( inode ) , sizeof ( struct ext4_inode_info ) ,
true ) ;
dump_stack ( ) ;
}
2021-08-23 09:13:58 +03:00
if ( EXT4_I ( inode ) - > i_reserved_data_blocks )
ext4_msg ( inode - > i_sb , KERN_ERR ,
" Inode %lu (%p): i_reserved_data_blocks (%u) not cleared! " ,
inode - > i_ino , EXT4_I ( inode ) ,
EXT4_I ( inode ) - > i_reserved_data_blocks ) ;
2006-10-11 12:20:50 +04:00
}
2008-07-26 06:45:34 +04:00
static void init_once ( void * foo )
2006-10-11 12:20:50 +04:00
{
2022-04-01 11:13:21 +03:00
struct ext4_inode_info * ei = foo ;
2006-10-11 12:20:50 +04:00
2007-05-17 09:10:57 +04:00
INIT_LIST_HEAD ( & ei - > i_orphan ) ;
init_rwsem ( & ei - > xattr_sem ) ;
2008-01-29 07:58:26 +03:00
init_rwsem ( & ei - > i_data_sem ) ;
2007-05-17 09:10:57 +04:00
inode_init_once ( & ei - > vfs_inode ) ;
2020-10-15 23:37:57 +03:00
ext4_fc_init_inode ( & ei - > vfs_inode ) ;
2006-10-11 12:20:50 +04:00
}
2014-02-18 05:34:53 +04:00
static int __init init_inodecache ( void )
2006-10-11 12:20:50 +04:00
{
ext4: Define usercopy region in ext4_inode_cache slab cache
The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
and therefore contained in the ext4_inode_cache slab cache, need
to be copied to/from userspace.
cache object allocation:
fs/ext4/super.c:
ext4_alloc_inode(...):
struct ext4_inode_info *ei;
...
ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS);
...
return &ei->vfs_inode;
include/trace/events/ext4.h:
#define EXT4_I(inode) \
(container_of(inode, struct ext4_inode_info, vfs_inode))
fs/ext4/namei.c:
ext4_symlink(...):
...
inode->i_link = (char *)&EXT4_I(inode)->i_data;
example usage trace:
readlink_copy+0x43/0x70
vfs_readlink+0x62/0x110
SyS_readlinkat+0x100/0x130
fs/namei.c:
readlink_copy(..., link):
...
copy_to_user(..., link, len)
(inlined into vfs_readlink)
generic_readlink(dentry, ...):
struct inode *inode = d_inode(dentry);
const char *link = inode->i_link;
...
readlink_copy(..., link);
In support of usercopy hardening, this patch defines a region in the
ext4_inode_cache slab cache in which userspace copy operations are
allowed.
This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, provide usage trace]
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-11 05:50:36 +03:00
ext4_inode_cachep = kmem_cache_create_usercopy ( " ext4_inode_cache " ,
sizeof ( struct ext4_inode_info ) , 0 ,
( SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD |
SLAB_ACCOUNT ) ,
offsetof ( struct ext4_inode_info , i_data ) ,
sizeof_field ( struct ext4_inode_info , i_data ) ,
init_once ) ;
2006-10-11 12:20:53 +04:00
if ( ext4_inode_cachep = = NULL )
2006-10-11 12:20:50 +04:00
return - ENOMEM ;
return 0 ;
}
static void destroy_inodecache ( void )
{
2012-09-26 05:33:07 +04:00
/*
* Make sure all delayed rcu free inodes are flushed before we
* destroy cache .
*/
rcu_barrier ( ) ;
2006-10-11 12:20:53 +04:00
kmem_cache_destroy ( ext4_inode_cachep ) ;
2006-10-11 12:20:50 +04:00
}
2010-06-07 21:16:22 +04:00
void ext4_clear_inode ( struct inode * inode )
2006-10-11 12:20:50 +04:00
{
2020-10-15 23:37:57 +03:00
ext4_fc_del ( inode ) ;
2010-06-07 21:16:22 +04:00
invalidate_inode_buffers ( inode ) ;
2012-05-03 16:48:02 +04:00
clear_inode ( inode ) ;
2020-08-17 10:36:15 +03:00
ext4_discard_preallocations ( inode , 0 ) ;
2012-11-09 06:57:32 +04:00
ext4_es_remove_extent ( inode , 0 , EXT_MAX_BLOCKS ) ;
2019-11-08 14:45:11 +03:00
dquot_drop ( inode ) ;
2011-01-10 20:29:43 +03:00
if ( EXT4_I ( inode ) - > jinode ) {
jbd2_journal_release_jbd_inode ( EXT4_JOURNAL ( inode ) ,
EXT4_I ( inode ) - > jinode ) ;
jbd2_free_inode ( EXT4_I ( inode ) - > jinode ) ;
EXT4_I ( inode ) - > jinode = NULL ;
}
2018-01-12 07:30:13 +03:00
fscrypt_put_encryption_info ( inode ) ;
2019-07-22 19:26:24 +03:00
fsverity_cleanup_inode ( inode ) ;
2006-10-11 12:20:50 +04:00
}
2007-10-22 03:42:08 +04:00
static struct inode * ext4_nfs_get_inode ( struct super_block * sb ,
2009-06-04 01:59:28 +04:00
u64 ino , u32 generation )
2006-10-11 12:20:50 +04:00
{
struct inode * inode ;
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 20:29:13 +03:00
/*
2006-10-11 12:20:50 +04:00
* Currently we don ' t know the generation for parent directory , so
* a generation of 0 means " accept any "
*/
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 20:29:13 +03:00
inode = ext4_iget ( sb , ino , EXT4_IGET_HANDLE ) ;
2008-02-07 11:15:37 +03:00
if ( IS_ERR ( inode ) )
return ERR_CAST ( inode ) ;
if ( generation & & inode - > i_generation ! = generation ) {
2006-10-11 12:20:50 +04:00
iput ( inode ) ;
return ERR_PTR ( - ESTALE ) ;
}
2007-10-22 03:42:08 +04:00
return inode ;
}
static struct dentry * ext4_fh_to_dentry ( struct super_block * sb , struct fid * fid ,
2009-06-04 01:59:28 +04:00
int fh_len , int fh_type )
2007-10-22 03:42:08 +04:00
{
return generic_fh_to_dentry ( sb , fid , fh_len , fh_type ,
ext4_nfs_get_inode ) ;
}
static struct dentry * ext4_fh_to_parent ( struct super_block * sb , struct fid * fid ,
2009-06-04 01:59:28 +04:00
int fh_len , int fh_type )
2007-10-22 03:42:08 +04:00
{
return generic_fh_to_parent ( sb , fid , fh_len , fh_type ,
ext4_nfs_get_inode ) ;
2006-10-11 12:20:50 +04:00
}
2018-12-19 22:07:58 +03:00
static int ext4_nfs_commit_metadata ( struct inode * inode )
{
struct writeback_control wbc = {
. sync_mode = WB_SYNC_ALL
} ;
trace_ext4_nfs_commit_metadata ( inode ) ;
return ext4_write_inode ( inode , & wbc ) ;
}
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2017-04-30 06:47:50 +03:00
static const char * const quotatypes [ ] = INITQFNAMES ;
2016-01-09 00:01:22 +03:00
# define QTYPE2NAME(t) (quotatypes[t])
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
static int ext4_write_dquot ( struct dquot * dquot ) ;
static int ext4_acquire_dquot ( struct dquot * dquot ) ;
static int ext4_release_dquot ( struct dquot * dquot ) ;
static int ext4_mark_dquot_dirty ( struct dquot * dquot ) ;
static int ext4_write_info ( struct super_block * sb , int type ) ;
2008-04-28 13:14:34 +04:00
static int ext4_quota_on ( struct super_block * sb , int type , int format_id ,
2016-11-21 03:49:34 +03:00
const struct path * path ) ;
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_read ( struct super_block * sb , int type , char * data ,
2006-10-11 12:20:50 +04:00
size_t len , loff_t off ) ;
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_write ( struct super_block * sb , int type ,
2006-10-11 12:20:50 +04:00
const char * data , size_t len , loff_t off ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
static int ext4_quota_enable ( struct super_block * sb , int type , int format_id ,
unsigned int flags ) ;
2006-10-11 12:20:50 +04:00
2014-09-29 16:58:25 +04:00
static struct dquot * * ext4_get_dquots ( struct inode * inode )
{
return EXT4_I ( inode ) - > i_dquot ;
}
2009-09-22 04:01:08 +04:00
static const struct dquot_operations ext4_quota_operations = {
2017-06-22 18:46:48 +03:00
. get_reserved_space = ext4_get_reserved_space ,
. write_dquot = ext4_write_dquot ,
. acquire_dquot = ext4_acquire_dquot ,
. release_dquot = ext4_release_dquot ,
. mark_dirty = ext4_mark_dquot_dirty ,
. write_info = ext4_write_info ,
. alloc_dquot = dquot_alloc ,
. destroy_dquot = dquot_destroy ,
. get_projid = ext4_get_projid ,
. get_inode_usage = ext4_get_inode_usage ,
2019-10-06 13:30:28 +03:00
. get_next_id = dquot_get_next_id ,
2006-10-11 12:20:50 +04:00
} ;
2009-09-22 04:01:09 +04:00
static const struct quotactl_ops ext4_qctl_operations = {
2006-10-11 12:20:53 +04:00
. quota_on = ext4_quota_on ,
2010-08-02 01:48:36 +04:00
. quota_off = ext4_quota_off ,
2010-05-19 15:16:45 +04:00
. quota_sync = dquot_quota_sync ,
2014-11-19 02:42:09 +03:00
. get_state = dquot_get_state ,
2010-05-19 15:16:45 +04:00
. set_info = dquot_set_dqinfo ,
. get_dqblk = dquot_get_dqblk ,
2016-02-19 21:19:01 +03:00
. set_dqblk = dquot_set_dqblk ,
. get_nextdqblk = dquot_get_next_dqblk ,
2006-10-11 12:20:50 +04:00
} ;
# endif
2007-02-12 11:55:41 +03:00
static const struct super_operations ext4_sops = {
2006-10-11 12:20:53 +04:00
. alloc_inode = ext4_alloc_inode ,
2019-04-16 02:28:34 +03:00
. free_inode = ext4_free_in_core_inode ,
2006-10-11 12:20:53 +04:00
. destroy_inode = ext4_destroy_inode ,
. write_inode = ext4_write_inode ,
. dirty_inode = ext4_dirty_inode ,
2010-11-08 21:51:33 +03:00
. drop_inode = ext4_drop_inode ,
2010-06-07 21:16:22 +04:00
. evict_inode = ext4_evict_inode ,
2006-10-11 12:20:53 +04:00
. put_super = ext4_put_super ,
. sync_fs = ext4_sync_fs ,
2009-01-10 03:40:58 +03:00
. freeze_fs = ext4_freeze ,
. unfreeze_fs = ext4_unfreeze ,
2006-10-11 12:20:53 +04:00
. statfs = ext4_statfs ,
. show_options = ext4_show_options ,
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2006-10-11 12:20:53 +04:00
. quota_read = ext4_quota_read ,
. quota_write = ext4_quota_write ,
2014-09-29 16:58:25 +04:00
. get_dquots = ext4_get_dquots ,
2006-10-11 12:20:50 +04:00
# endif
} ;
2007-10-22 03:42:17 +04:00
static const struct export_operations ext4_export_ops = {
2007-10-22 03:42:08 +04:00
. fh_to_dentry = ext4_fh_to_dentry ,
. fh_to_parent = ext4_fh_to_parent ,
2006-10-11 12:20:53 +04:00
. get_parent = ext4_get_parent ,
2018-12-19 22:07:58 +03:00
. commit_metadata = ext4_nfs_commit_metadata ,
2006-10-11 12:20:50 +04:00
} ;
enum {
Opt_bsd_df , Opt_minix_df , Opt_grpid , Opt_nogrpid ,
2021-10-27 17:18:57 +03:00
Opt_resgid , Opt_resuid , Opt_sb ,
2012-03-04 03:04:40 +04:00
Opt_nouid32 , Opt_debug , Opt_removed ,
2006-10-11 12:20:50 +04:00
Opt_user_xattr , Opt_nouser_xattr , Opt_acl , Opt_noacl ,
2012-03-04 03:04:40 +04:00
Opt_auto_da_alloc , Opt_noauto_da_alloc , Opt_noload ,
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
Opt_commit , Opt_min_batch_time , Opt_max_batch_time , Opt_journal_dev ,
Opt_journal_path , Opt_journal_checksum , Opt_journal_async_commit ,
2006-10-11 12:20:50 +04:00
Opt_abort , Opt_data_journal , Opt_data_ordered , Opt_data_writeback ,
2015-04-16 08:56:00 +03:00
Opt_data_err_abort , Opt_data_err_ignore , Opt_test_dummy_encryption ,
2020-07-02 04:56:07 +03:00
Opt_inlinecrypt ,
2021-10-27 17:18:57 +03:00
Opt_usrjquota , Opt_grpjquota , Opt_quota ,
2012-03-02 21:14:24 +04:00
Opt_noquota , Opt_barrier , Opt_nobarrier , Opt_err ,
2020-05-28 18:00:00 +03:00
Opt_usrquota , Opt_grpquota , Opt_prjquota , Opt_i_version ,
Opt_dax , Opt_dax_always , Opt_dax_inode , Opt_dax_never ,
2018-06-13 06:34:57 +03:00
Opt_stripe , Opt_delalloc , Opt_nodelalloc , Opt_warn_on_error ,
2021-12-22 13:45:16 +03:00
Opt_nowarn_on_error , Opt_mblk_io_submit , Opt_debug_want_extra_isize ,
ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:
psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
Under memory pressure, it looks like part of a file can end up getting
replaced by zero's. Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page(). The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.
To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:
begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;
If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.
This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 23:27:50 +03:00
Opt_nomblk_io_submit , Opt_block_validity , Opt_noblock_validity ,
2009-11-19 22:25:42 +03:00
Opt_inode_readahead_blks , Opt_journal_ioprio ,
2010-03-05 00:14:02 +03:00
Opt_dioread_nolock , Opt_dioread_lock ,
2011-12-13 07:06:18 +04:00
Opt_discard , Opt_nodiscard , Opt_init_itable , Opt_noinit_itable ,
2017-06-22 18:55:14 +03:00
Opt_max_dir_size_kb , Opt_nojournal_checksum , Opt_nombcache ,
2021-04-01 20:21:29 +03:00
Opt_no_prefetch_block_bitmaps , Opt_mb_optimize_scan ,
2021-10-27 17:18:46 +03:00
Opt_errors , Opt_data , Opt_data_err , Opt_jqfmt , Opt_dax_type ,
2020-10-15 23:37:59 +03:00
# ifdef CONFIG_EXT4_DEBUG
2020-11-06 06:59:11 +03:00
Opt_fc_debug_max_replay , Opt_fc_debug_force
2020-10-15 23:37:59 +03:00
# endif
2006-10-11 12:20:50 +04:00
} ;
2021-10-27 17:18:46 +03:00
static const struct constant_table ext4_param_errors [ ] = {
2021-10-27 17:18:57 +03:00
{ " continue " , EXT4_MOUNT_ERRORS_CONT } ,
{ " panic " , EXT4_MOUNT_ERRORS_PANIC } ,
{ " remount-ro " , EXT4_MOUNT_ERRORS_RO } ,
2021-10-27 17:18:46 +03:00
{ }
} ;
static const struct constant_table ext4_param_data [ ] = {
2021-10-27 17:18:57 +03:00
{ " journal " , EXT4_MOUNT_JOURNAL_DATA } ,
{ " ordered " , EXT4_MOUNT_ORDERED_DATA } ,
{ " writeback " , EXT4_MOUNT_WRITEBACK_DATA } ,
2021-10-27 17:18:46 +03:00
{ }
} ;
static const struct constant_table ext4_param_data_err [ ] = {
{ " abort " , Opt_data_err_abort } ,
{ " ignore " , Opt_data_err_ignore } ,
{ }
} ;
static const struct constant_table ext4_param_jqfmt [ ] = {
2021-10-27 17:18:57 +03:00
{ " vfsold " , QFMT_VFS_OLD } ,
{ " vfsv0 " , QFMT_VFS_V0 } ,
{ " vfsv1 " , QFMT_VFS_V1 } ,
2021-10-27 17:18:46 +03:00
{ }
} ;
static const struct constant_table ext4_param_dax [ ] = {
{ " always " , Opt_dax_always } ,
{ " inode " , Opt_dax_inode } ,
{ " never " , Opt_dax_never } ,
{ }
} ;
/* String parameter that allows empty argument */
# define fsparam_string_empty(NAME, OPT) \
__fsparam ( fs_param_is_string , NAME , OPT , fs_param_can_be_empty , NULL )
/*
* Mount option specification
* We don ' t use fsparam_flag_no because of the way we set the
* options and the way we show them in _ext4_show_options ( ) . To
* keep the changes to a minimum , let ' s keep the negative options
* separate for now .
*/
static const struct fs_parameter_spec ext4_param_specs [ ] = {
fsparam_flag ( " bsddf " , Opt_bsd_df ) ,
fsparam_flag ( " minixdf " , Opt_minix_df ) ,
fsparam_flag ( " grpid " , Opt_grpid ) ,
fsparam_flag ( " bsdgroups " , Opt_grpid ) ,
fsparam_flag ( " nogrpid " , Opt_nogrpid ) ,
fsparam_flag ( " sysvgroups " , Opt_nogrpid ) ,
fsparam_u32 ( " resgid " , Opt_resgid ) ,
fsparam_u32 ( " resuid " , Opt_resuid ) ,
fsparam_u32 ( " sb " , Opt_sb ) ,
fsparam_enum ( " errors " , Opt_errors , ext4_param_errors ) ,
fsparam_flag ( " nouid32 " , Opt_nouid32 ) ,
fsparam_flag ( " debug " , Opt_debug ) ,
fsparam_flag ( " oldalloc " , Opt_removed ) ,
fsparam_flag ( " orlov " , Opt_removed ) ,
fsparam_flag ( " user_xattr " , Opt_user_xattr ) ,
fsparam_flag ( " nouser_xattr " , Opt_nouser_xattr ) ,
fsparam_flag ( " acl " , Opt_acl ) ,
fsparam_flag ( " noacl " , Opt_noacl ) ,
fsparam_flag ( " norecovery " , Opt_noload ) ,
fsparam_flag ( " noload " , Opt_noload ) ,
fsparam_flag ( " bh " , Opt_removed ) ,
fsparam_flag ( " nobh " , Opt_removed ) ,
fsparam_u32 ( " commit " , Opt_commit ) ,
fsparam_u32 ( " min_batch_time " , Opt_min_batch_time ) ,
fsparam_u32 ( " max_batch_time " , Opt_max_batch_time ) ,
fsparam_u32 ( " journal_dev " , Opt_journal_dev ) ,
fsparam_bdev ( " journal_path " , Opt_journal_path ) ,
fsparam_flag ( " journal_checksum " , Opt_journal_checksum ) ,
fsparam_flag ( " nojournal_checksum " , Opt_nojournal_checksum ) ,
fsparam_flag ( " journal_async_commit " , Opt_journal_async_commit ) ,
fsparam_flag ( " abort " , Opt_abort ) ,
fsparam_enum ( " data " , Opt_data , ext4_param_data ) ,
fsparam_enum ( " data_err " , Opt_data_err ,
ext4_param_data_err ) ,
fsparam_string_empty
( " usrjquota " , Opt_usrjquota ) ,
fsparam_string_empty
( " grpjquota " , Opt_grpjquota ) ,
fsparam_enum ( " jqfmt " , Opt_jqfmt , ext4_param_jqfmt ) ,
fsparam_flag ( " grpquota " , Opt_grpquota ) ,
fsparam_flag ( " quota " , Opt_quota ) ,
fsparam_flag ( " noquota " , Opt_noquota ) ,
fsparam_flag ( " usrquota " , Opt_usrquota ) ,
fsparam_flag ( " prjquota " , Opt_prjquota ) ,
fsparam_flag ( " barrier " , Opt_barrier ) ,
fsparam_u32 ( " barrier " , Opt_barrier ) ,
fsparam_flag ( " nobarrier " , Opt_nobarrier ) ,
fsparam_flag ( " i_version " , Opt_i_version ) ,
fsparam_flag ( " dax " , Opt_dax ) ,
fsparam_enum ( " dax " , Opt_dax_type , ext4_param_dax ) ,
fsparam_u32 ( " stripe " , Opt_stripe ) ,
fsparam_flag ( " delalloc " , Opt_delalloc ) ,
fsparam_flag ( " nodelalloc " , Opt_nodelalloc ) ,
fsparam_flag ( " warn_on_error " , Opt_warn_on_error ) ,
fsparam_flag ( " nowarn_on_error " , Opt_nowarn_on_error ) ,
fsparam_u32 ( " debug_want_extra_isize " ,
Opt_debug_want_extra_isize ) ,
fsparam_flag ( " mblk_io_submit " , Opt_removed ) ,
fsparam_flag ( " nomblk_io_submit " , Opt_removed ) ,
fsparam_flag ( " block_validity " , Opt_block_validity ) ,
fsparam_flag ( " noblock_validity " , Opt_noblock_validity ) ,
fsparam_u32 ( " inode_readahead_blks " ,
Opt_inode_readahead_blks ) ,
fsparam_u32 ( " journal_ioprio " , Opt_journal_ioprio ) ,
fsparam_u32 ( " auto_da_alloc " , Opt_auto_da_alloc ) ,
fsparam_flag ( " auto_da_alloc " , Opt_auto_da_alloc ) ,
fsparam_flag ( " noauto_da_alloc " , Opt_noauto_da_alloc ) ,
fsparam_flag ( " dioread_nolock " , Opt_dioread_nolock ) ,
fsparam_flag ( " nodioread_nolock " , Opt_dioread_lock ) ,
fsparam_flag ( " dioread_lock " , Opt_dioread_lock ) ,
fsparam_flag ( " discard " , Opt_discard ) ,
fsparam_flag ( " nodiscard " , Opt_nodiscard ) ,
fsparam_u32 ( " init_itable " , Opt_init_itable ) ,
fsparam_flag ( " init_itable " , Opt_init_itable ) ,
fsparam_flag ( " noinit_itable " , Opt_noinit_itable ) ,
# ifdef CONFIG_EXT4_DEBUG
fsparam_flag ( " fc_debug_force " , Opt_fc_debug_force ) ,
fsparam_u32 ( " fc_debug_max_replay " , Opt_fc_debug_max_replay ) ,
# endif
fsparam_u32 ( " max_dir_size_kb " , Opt_max_dir_size_kb ) ,
fsparam_flag ( " test_dummy_encryption " ,
Opt_test_dummy_encryption ) ,
fsparam_string ( " test_dummy_encryption " ,
Opt_test_dummy_encryption ) ,
fsparam_flag ( " inlinecrypt " , Opt_inlinecrypt ) ,
fsparam_flag ( " nombcache " , Opt_nombcache ) ,
fsparam_flag ( " no_mbcache " , Opt_nombcache ) , /* for backward compatibility */
fsparam_flag ( " prefetch_block_bitmaps " ,
Opt_removed ) ,
fsparam_flag ( " no_prefetch_block_bitmaps " ,
Opt_no_prefetch_block_bitmaps ) ,
fsparam_s32 ( " mb_optimize_scan " , Opt_mb_optimize_scan ) ,
fsparam_string ( " check " , Opt_removed ) , /* mount option from ext2/3 */
fsparam_flag ( " nocheck " , Opt_removed ) , /* mount option from ext2/3 */
fsparam_flag ( " reservation " , Opt_removed ) , /* mount option from ext2/3 */
fsparam_flag ( " noreservation " , Opt_removed ) , /* mount option from ext2/3 */
fsparam_u32 ( " journal " , Opt_removed ) , /* mount option from ext2/3 */
{ }
} ;
2009-01-06 06:46:26 +03:00
# define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 20:21:27 +03:00
2017-04-30 06:47:50 +03:00
static const char deprecated_msg [ ] =
" Mount option \" %s \" will be removed by %s \n "
2010-03-02 06:29:21 +03:00
" Contact linux-ext4@vger.kernel.org if you think we should keep it. \n " ;
2009-01-06 06:46:26 +03:00
2012-03-04 08:20:47 +04:00
# define MOPT_SET 0x0001
# define MOPT_CLEAR 0x0002
# define MOPT_NOSUPPORT 0x0004
# define MOPT_EXPLICIT 0x0008
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2012-03-04 08:20:47 +04:00
# define MOPT_Q 0
2021-10-27 17:18:57 +03:00
# define MOPT_QFMT 0x0010
2012-03-04 08:20:47 +04:00
# else
# define MOPT_Q MOPT_NOSUPPORT
# define MOPT_QFMT MOPT_NOSUPPORT
2006-10-11 12:20:50 +04:00
# endif
2021-10-27 17:18:57 +03:00
# define MOPT_NO_EXT2 0x0020
# define MOPT_NO_EXT3 0x0040
2013-02-03 08:38:39 +04:00
# define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
2021-10-27 17:18:57 +03:00
# define MOPT_SKIP 0x0080
# define MOPT_2 0x0100
2012-03-04 08:20:47 +04:00
static const struct mount_opts {
int token ;
int mount_opt ;
int flags ;
} ext4_mount_opts [ ] = {
{ Opt_minix_df , EXT4_MOUNT_MINIX_DF , MOPT_SET } ,
{ Opt_bsd_df , EXT4_MOUNT_MINIX_DF , MOPT_CLEAR } ,
{ Opt_grpid , EXT4_MOUNT_GRPID , MOPT_SET } ,
{ Opt_nogrpid , EXT4_MOUNT_GRPID , MOPT_CLEAR } ,
{ Opt_block_validity , EXT4_MOUNT_BLOCK_VALIDITY , MOPT_SET } ,
{ Opt_noblock_validity , EXT4_MOUNT_BLOCK_VALIDITY , MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_dioread_nolock , EXT4_MOUNT_DIOREAD_NOLOCK ,
MOPT_EXT4_ONLY | MOPT_SET } ,
{ Opt_dioread_lock , EXT4_MOUNT_DIOREAD_NOLOCK ,
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2012-03-04 08:20:47 +04:00
{ Opt_discard , EXT4_MOUNT_DISCARD , MOPT_SET } ,
{ Opt_nodiscard , EXT4_MOUNT_DISCARD , MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_delalloc , EXT4_MOUNT_DELALLOC ,
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
{ Opt_nodelalloc , EXT4_MOUNT_DELALLOC ,
2013-08-09 07:01:24 +04:00
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2018-06-13 06:34:57 +03:00
{ Opt_warn_on_error , EXT4_MOUNT_WARN_ON_ERROR , MOPT_SET } ,
{ Opt_nowarn_on_error , EXT4_MOUNT_WARN_ON_ERROR , MOPT_CLEAR } ,
2022-05-10 21:32:32 +03:00
{ Opt_commit , 0 , MOPT_NO_EXT2 } ,
2014-11-26 00:20:50 +03:00
{ Opt_nojournal_checksum , EXT4_MOUNT_JOURNAL_CHECKSUM ,
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_journal_checksum , EXT4_MOUNT_JOURNAL_CHECKSUM ,
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
2012-03-04 08:20:47 +04:00
{ Opt_journal_async_commit , ( EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
2013-02-03 08:38:39 +04:00
EXT4_MOUNT_JOURNAL_CHECKSUM ) ,
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
2013-02-03 08:38:39 +04:00
{ Opt_noload , EXT4_MOUNT_NOLOAD , MOPT_NO_EXT2 | MOPT_SET } ,
2021-10-27 17:18:57 +03:00
{ Opt_data_err , EXT4_MOUNT_DATA_ERR_ABORT , MOPT_NO_EXT2 } ,
2012-03-04 08:20:47 +04:00
{ Opt_barrier , EXT4_MOUNT_BARRIER , MOPT_SET } ,
{ Opt_nobarrier , EXT4_MOUNT_BARRIER , MOPT_CLEAR } ,
{ Opt_noauto_da_alloc , EXT4_MOUNT_NO_AUTO_DA_ALLOC , MOPT_SET } ,
{ Opt_auto_da_alloc , EXT4_MOUNT_NO_AUTO_DA_ALLOC , MOPT_CLEAR } ,
{ Opt_noinit_itable , EXT4_MOUNT_INIT_INODE_TABLE , MOPT_CLEAR } ,
2021-10-27 17:18:57 +03:00
{ Opt_dax_type , 0 , MOPT_EXT4_ONLY } ,
{ Opt_journal_dev , 0 , MOPT_NO_EXT2 } ,
{ Opt_journal_path , 0 , MOPT_NO_EXT2 } ,
{ Opt_journal_ioprio , 0 , MOPT_NO_EXT2 } ,
{ Opt_data , 0 , MOPT_NO_EXT2 } ,
2012-03-04 08:20:47 +04:00
{ Opt_user_xattr , EXT4_MOUNT_XATTR_USER , MOPT_SET } ,
{ Opt_nouser_xattr , EXT4_MOUNT_XATTR_USER , MOPT_CLEAR } ,
2008-10-11 04:02:48 +04:00
# ifdef CONFIG_EXT4_FS_POSIX_ACL
2012-03-04 08:20:47 +04:00
{ Opt_acl , EXT4_MOUNT_POSIX_ACL , MOPT_SET } ,
{ Opt_noacl , EXT4_MOUNT_POSIX_ACL , MOPT_CLEAR } ,
2006-10-11 12:20:50 +04:00
# else
2012-03-04 08:20:47 +04:00
{ Opt_acl , 0 , MOPT_NOSUPPORT } ,
{ Opt_noacl , 0 , MOPT_NOSUPPORT } ,
2006-10-11 12:20:50 +04:00
# endif
2012-03-04 08:20:47 +04:00
{ Opt_nouid32 , EXT4_MOUNT_NO_UID32 , MOPT_SET } ,
{ Opt_debug , EXT4_MOUNT_DEBUG , MOPT_SET } ,
{ Opt_quota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA , MOPT_SET | MOPT_Q } ,
{ Opt_usrquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA ,
MOPT_SET | MOPT_Q } ,
{ Opt_grpquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_GRPQUOTA ,
MOPT_SET | MOPT_Q } ,
2016-09-06 06:08:16 +03:00
{ Opt_prjquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_PRJQUOTA ,
MOPT_SET | MOPT_Q } ,
2012-03-04 08:20:47 +04:00
{ Opt_noquota , ( EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
2016-09-06 06:08:16 +03:00
EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA ) ,
MOPT_CLEAR | MOPT_Q } ,
2021-10-27 17:18:57 +03:00
{ Opt_usrjquota , 0 , MOPT_Q } ,
{ Opt_grpjquota , 0 , MOPT_Q } ,
{ Opt_jqfmt , 0 , MOPT_QFMT } ,
2017-06-22 18:55:14 +03:00
{ Opt_nombcache , EXT4_MOUNT_NO_MBCACHE , MOPT_SET } ,
2021-04-01 20:21:29 +03:00
{ Opt_no_prefetch_block_bitmaps , EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS ,
2020-07-17 07:14:40 +03:00
MOPT_SET } ,
2020-11-06 06:59:11 +03:00
# ifdef CONFIG_EXT4_DEBUG
2020-10-15 23:38:00 +03:00
{ Opt_fc_debug_force , EXT4_MOUNT2_JOURNAL_FAST_COMMIT ,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY } ,
2020-10-15 23:37:59 +03:00
# endif
2012-03-04 08:20:47 +04:00
{ Opt_err , 0 , 0 }
} ;
2022-01-18 09:56:14 +03:00
# if IS_ENABLED(CONFIG_UNICODE)
2019-04-25 21:05:42 +03:00
static const struct ext4_sb_encodings {
__u16 magic ;
char * name ;
2021-09-15 10:00:00 +03:00
unsigned int version ;
2019-04-25 21:05:42 +03:00
} ext4_sb_encoding_map [ ] = {
2021-09-15 10:00:00 +03:00
{ EXT4_ENC_UTF8_12_1 , " utf8 " , UNICODE_AGE ( 12 , 1 , 0 ) } ,
2019-04-25 21:05:42 +03:00
} ;
2021-09-15 09:59:56 +03:00
static const struct ext4_sb_encodings *
ext4_sb_read_encoding ( const struct ext4_super_block * es )
2019-04-25 21:05:42 +03:00
{
__u16 magic = le16_to_cpu ( es - > s_encoding ) ;
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( ext4_sb_encoding_map ) ; i + + )
if ( magic = = ext4_sb_encoding_map [ i ] . magic )
2021-09-15 09:59:56 +03:00
return & ext4_sb_encoding_map [ i ] ;
2019-04-25 21:05:42 +03:00
2021-09-15 09:59:56 +03:00
return NULL ;
2019-04-25 21:05:42 +03:00
}
# endif
2021-10-27 17:18:52 +03:00
# define EXT4_SPEC_JQUOTA (1 << 0)
# define EXT4_SPEC_JQFMT (1 << 1)
# define EXT4_SPEC_DATAJ (1 << 2)
# define EXT4_SPEC_SB_BLOCK (1 << 3)
# define EXT4_SPEC_JOURNAL_DEV (1 << 4)
# define EXT4_SPEC_JOURNAL_IOPRIO (1 << 5)
# define EXT4_SPEC_s_want_extra_isize (1 << 7)
# define EXT4_SPEC_s_max_batch_time (1 << 8)
# define EXT4_SPEC_s_min_batch_time (1 << 9)
# define EXT4_SPEC_s_inode_readahead_blks (1 << 10)
# define EXT4_SPEC_s_li_wait_mult (1 << 11)
# define EXT4_SPEC_s_max_dir_size_kb (1 << 12)
# define EXT4_SPEC_s_stripe (1 << 13)
# define EXT4_SPEC_s_resuid (1 << 14)
# define EXT4_SPEC_s_resgid (1 << 15)
# define EXT4_SPEC_s_commit_interval (1 << 16)
# define EXT4_SPEC_s_fc_debug_max_replay (1 << 17)
2021-10-27 17:18:53 +03:00
# define EXT4_SPEC_s_sb_block (1 << 18)
2022-03-08 12:52:00 +03:00
# define EXT4_SPEC_mb_optimize_scan (1 << 19)
2021-10-27 17:18:52 +03:00
2021-10-27 17:18:48 +03:00
struct ext4_fs_context {
2021-10-27 17:18:50 +03:00
char * s_qf_names [ EXT4_MAXQUOTAS ] ;
2022-05-26 07:04:12 +03:00
struct fscrypt_dummy_policy dummy_enc_policy ;
2021-10-27 17:18:50 +03:00
int s_jquota_fmt ; /* Format of quota to use */
2021-10-27 17:18:52 +03:00
# ifdef CONFIG_EXT4_DEBUG
int s_fc_debug_max_replay ;
# endif
2021-10-27 17:18:50 +03:00
unsigned short qname_spec ;
2021-10-27 17:18:52 +03:00
unsigned long vals_s_flags ; /* Bits to set in s_flags */
unsigned long mask_s_flags ; /* Bits changed in s_flags */
2021-10-27 17:18:50 +03:00
unsigned long journal_devnum ;
2021-10-27 17:18:52 +03:00
unsigned long s_commit_interval ;
unsigned long s_stripe ;
unsigned int s_inode_readahead_blks ;
unsigned int s_want_extra_isize ;
unsigned int s_li_wait_mult ;
unsigned int s_max_dir_size_kb ;
2021-10-27 17:18:50 +03:00
unsigned int journal_ioprio ;
2021-10-27 17:18:52 +03:00
unsigned int vals_s_mount_opt ;
unsigned int mask_s_mount_opt ;
unsigned int vals_s_mount_opt2 ;
unsigned int mask_s_mount_opt2 ;
2022-02-01 16:13:45 +03:00
unsigned long vals_s_mount_flags ;
unsigned long mask_s_mount_flags ;
2021-10-27 17:18:51 +03:00
unsigned int opt_flags ; /* MOPT flags */
2021-10-27 17:18:52 +03:00
unsigned int spec ;
u32 s_max_batch_time ;
u32 s_min_batch_time ;
kuid_t s_resuid ;
kgid_t s_resgid ;
2021-10-27 17:18:53 +03:00
ext4_fsblk_t s_sb_block ;
2021-04-01 20:21:24 +03:00
} ;
2021-10-27 17:18:56 +03:00
static void ext4_fc_free ( struct fs_context * fc )
{
struct ext4_fs_context * ctx = fc - > fs_private ;
int i ;
if ( ! ctx )
return ;
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
kfree ( ctx - > s_qf_names [ i ] ) ;
2022-05-26 07:04:12 +03:00
fscrypt_free_dummy_policy ( & ctx - > dummy_enc_policy ) ;
2021-10-27 17:18:56 +03:00
kfree ( ctx ) ;
}
int ext4_init_fs_context ( struct fs_context * fc )
{
2021-12-15 14:43:09 +03:00
struct ext4_fs_context * ctx ;
2021-10-27 17:18:56 +03:00
ctx = kzalloc ( sizeof ( struct ext4_fs_context ) , GFP_KERNEL ) ;
if ( ! ctx )
return - ENOMEM ;
fc - > fs_private = ctx ;
fc - > ops = & ext4_context_ops ;
return 0 ;
}
2021-10-27 17:18:50 +03:00
# ifdef CONFIG_QUOTA
/*
* Note the name of the specified quota file .
*/
static int note_qf_name ( struct fs_context * fc , int qtype ,
struct fs_parameter * param )
{
struct ext4_fs_context * ctx = fc - > fs_private ;
char * qname ;
if ( param - > size < 1 ) {
ext4_msg ( NULL , KERN_ERR , " Missing quota name " ) ;
return - EINVAL ;
}
if ( strchr ( param - > string , ' / ' ) ) {
ext4_msg ( NULL , KERN_ERR ,
" quotafile must be on filesystem root " ) ;
return - EINVAL ;
}
if ( ctx - > s_qf_names [ qtype ] ) {
if ( strcmp ( ctx - > s_qf_names [ qtype ] , param - > string ) ! = 0 ) {
ext4_msg ( NULL , KERN_ERR ,
" %s quota file already specified " ,
QTYPE2NAME ( qtype ) ) ;
return - EINVAL ;
}
return 0 ;
}
qname = kmemdup_nul ( param - > string , param - > size , GFP_KERNEL ) ;
if ( ! qname ) {
ext4_msg ( NULL , KERN_ERR ,
" Not enough memory for storing quotafile name " ) ;
return - ENOMEM ;
}
ctx - > s_qf_names [ qtype ] = qname ;
ctx - > qname_spec | = 1 < < qtype ;
2021-10-27 17:18:52 +03:00
ctx - > spec | = EXT4_SPEC_JQUOTA ;
2021-10-27 17:18:50 +03:00
return 0 ;
}
/*
* Clear the name of the specified quota file .
*/
static int unnote_qf_name ( struct fs_context * fc , int qtype )
{
struct ext4_fs_context * ctx = fc - > fs_private ;
if ( ctx - > s_qf_names [ qtype ] )
kfree ( ctx - > s_qf_names [ qtype ] ) ;
ctx - > s_qf_names [ qtype ] = NULL ;
ctx - > qname_spec | = 1 < < qtype ;
2021-10-27 17:18:52 +03:00
ctx - > spec | = EXT4_SPEC_JQUOTA ;
2021-10-27 17:18:50 +03:00
return 0 ;
}
# endif
2022-05-26 07:04:12 +03:00
static int ext4_parse_test_dummy_encryption ( const struct fs_parameter * param ,
struct ext4_fs_context * ctx )
{
int err ;
if ( ! IS_ENABLED ( CONFIG_FS_ENCRYPTION ) ) {
ext4_msg ( NULL , KERN_WARNING ,
" test_dummy_encryption option not supported " ) ;
return - EINVAL ;
}
err = fscrypt_parse_test_dummy_encryption ( param ,
& ctx - > dummy_enc_policy ) ;
if ( err = = - EINVAL ) {
ext4_msg ( NULL , KERN_WARNING ,
" Value of option \" %s \" is unrecognized " , param - > key ) ;
} else if ( err = = - EEXIST ) {
ext4_msg ( NULL , KERN_WARNING ,
" Conflicting test_dummy_encryption options " ) ;
return - EINVAL ;
}
return err ;
}
2021-10-27 17:18:52 +03:00
# define EXT4_SET_CTX(name) \
2021-12-20 18:26:57 +03:00
static inline void ctx_set_ # # name ( struct ext4_fs_context * ctx , \
unsigned long flag ) \
2021-10-27 17:18:52 +03:00
{ \
ctx - > mask_s_ # # name | = flag ; \
ctx - > vals_s_ # # name | = flag ; \
2022-02-01 16:13:45 +03:00
}
# define EXT4_CLEAR_CTX(name) \
2021-12-20 18:26:57 +03:00
static inline void ctx_clear_ # # name ( struct ext4_fs_context * ctx , \
unsigned long flag ) \
2021-10-27 17:18:52 +03:00
{ \
ctx - > mask_s_ # # name | = flag ; \
ctx - > vals_s_ # # name & = ~ flag ; \
2022-02-01 16:13:45 +03:00
}
# define EXT4_TEST_CTX(name) \
2021-12-20 18:26:57 +03:00
static inline unsigned long \
ctx_test_ # # name ( struct ext4_fs_context * ctx , unsigned long flag ) \
2021-10-27 17:18:52 +03:00
{ \
2021-12-20 18:26:57 +03:00
return ( ctx - > vals_s_ # # name & flag ) ; \
2022-02-01 16:13:45 +03:00
}
2021-10-27 17:18:52 +03:00
2022-02-01 16:13:45 +03:00
EXT4_SET_CTX ( flags ) ; /* set only */
2021-10-27 17:18:52 +03:00
EXT4_SET_CTX ( mount_opt ) ;
2022-02-01 16:13:45 +03:00
EXT4_CLEAR_CTX ( mount_opt ) ;
EXT4_TEST_CTX ( mount_opt ) ;
2021-10-27 17:18:52 +03:00
EXT4_SET_CTX ( mount_opt2 ) ;
2022-02-01 16:13:45 +03:00
EXT4_CLEAR_CTX ( mount_opt2 ) ;
EXT4_TEST_CTX ( mount_opt2 ) ;
static inline void ctx_set_mount_flag ( struct ext4_fs_context * ctx , int bit )
{
set_bit ( bit , & ctx - > mask_s_mount_flags ) ;
set_bit ( bit , & ctx - > vals_s_mount_flags ) ;
}
2021-10-27 17:18:52 +03:00
2021-10-27 17:18:54 +03:00
static int ext4_parse_param ( struct fs_context * fc , struct fs_parameter * param )
2012-03-04 08:20:47 +04:00
{
2021-10-27 17:18:48 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
struct fs_parse_result result ;
2012-03-04 08:20:47 +04:00
const struct mount_opts * m ;
2021-10-27 17:18:48 +03:00
int is_remount ;
2012-02-08 03:41:49 +04:00
kuid_t uid ;
kgid_t gid ;
2021-10-27 17:18:48 +03:00
int token ;
token = fs_parse ( fc , ext4_param_specs , param , & result ) ;
if ( token < 0 )
return token ;
is_remount = fc - > purpose = = FS_CONTEXT_FOR_RECONFIGURE ;
2012-03-04 08:20:47 +04:00
2021-10-27 17:18:57 +03:00
for ( m = ext4_mount_opts ; m - > token ! = Opt_err ; m + + )
if ( token = = m - > token )
break ;
ctx - > opt_flags | = m - > flags ;
if ( m - > flags & MOPT_EXPLICIT ) {
if ( m - > mount_opt & EXT4_MOUNT_DELALLOC ) {
ctx_set_mount_opt2 ( ctx , EXT4_MOUNT2_EXPLICIT_DELALLOC ) ;
} else if ( m - > mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM ) {
ctx_set_mount_opt2 ( ctx ,
EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM ) ;
} else
return - EINVAL ;
}
if ( m - > flags & MOPT_NOSUPPORT ) {
ext4_msg ( NULL , KERN_ERR , " %s option not supported " ,
param - > key ) ;
return 0 ;
}
switch ( token ) {
2012-04-17 02:55:26 +04:00
# ifdef CONFIG_QUOTA
2021-10-27 17:18:57 +03:00
case Opt_usrjquota :
2021-10-27 17:18:48 +03:00
if ( ! * param - > string )
2021-10-27 17:18:50 +03:00
return unnote_qf_name ( fc , USRQUOTA ) ;
2021-10-27 17:18:48 +03:00
else
2021-10-27 17:18:50 +03:00
return note_qf_name ( fc , USRQUOTA , param ) ;
2021-10-27 17:18:57 +03:00
case Opt_grpjquota :
2021-10-27 17:18:48 +03:00
if ( ! * param - > string )
2021-10-27 17:18:50 +03:00
return unnote_qf_name ( fc , GRPQUOTA ) ;
2021-10-27 17:18:48 +03:00
else
2021-10-27 17:18:50 +03:00
return note_qf_name ( fc , GRPQUOTA , param ) ;
2012-04-17 02:55:26 +04:00
# endif
2012-03-05 07:06:20 +04:00
case Opt_noacl :
case Opt_nouser_xattr :
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_WARNING , deprecated_msg , param - > key , " 3.5 " ) ;
2012-03-05 07:06:20 +04:00
break ;
2012-03-04 08:20:47 +04:00
case Opt_sb :
2021-10-27 17:18:53 +03:00
if ( fc - > purpose = = FS_CONTEXT_FOR_RECONFIGURE ) {
ext4_msg ( NULL , KERN_WARNING ,
" Ignoring %s option on remount " , param - > key ) ;
} else {
ctx - > s_sb_block = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_sb_block ;
}
2021-10-27 17:18:54 +03:00
return 0 ;
2012-03-04 08:20:47 +04:00
case Opt_removed :
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_WARNING , " Ignoring removed %s option " ,
2021-10-27 17:18:48 +03:00
param - > key ) ;
2021-10-27 17:18:54 +03:00
return 0 ;
2012-03-04 08:20:47 +04:00
case Opt_abort :
2022-02-01 16:13:45 +03:00
ctx_set_mount_flag ( ctx , EXT4_MF_FS_ABORTED ) ;
2021-10-27 17:18:54 +03:00
return 0 ;
2012-03-04 08:20:47 +04:00
case Opt_i_version :
2021-12-22 13:45:17 +03:00
ext4_msg ( NULL , KERN_WARNING , deprecated_msg , param - > key , " 5.20 " ) ;
ext4_msg ( NULL , KERN_WARNING , " Use iversion instead \n " ) ;
2021-10-27 17:18:52 +03:00
ctx_set_flags ( ctx , SB_I_VERSION ) ;
2021-10-27 17:18:54 +03:00
return 0 ;
2020-07-02 04:56:07 +03:00
case Opt_inlinecrypt :
# ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
2021-10-27 17:18:52 +03:00
ctx_set_flags ( ctx , SB_INLINECRYPT ) ;
2020-07-02 04:56:07 +03:00
# else
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " inline encryption not supported " ) ;
2020-07-02 04:56:07 +03:00
# endif
2021-10-27 17:18:54 +03:00
return 0 ;
2021-10-27 17:18:48 +03:00
case Opt_errors :
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_ERRORS_MASK ) ;
2021-10-27 17:18:57 +03:00
ctx_set_mount_opt ( ctx , result . uint_32 ) ;
return 0 ;
# ifdef CONFIG_QUOTA
case Opt_jqfmt :
ctx - > s_jquota_fmt = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_JQFMT ;
return 0 ;
# endif
case Opt_data :
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_DATA_FLAGS ) ;
ctx_set_mount_opt ( ctx , result . uint_32 ) ;
ctx - > spec | = EXT4_SPEC_DATAJ ;
return 0 ;
case Opt_commit :
2021-10-27 17:18:48 +03:00
if ( result . uint_32 = = 0 )
2021-10-27 17:18:52 +03:00
ctx - > s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE ;
2021-10-27 17:18:48 +03:00
else if ( result . uint_32 > INT_MAX / HZ ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR ,
2019-08-28 18:25:01 +03:00
" Invalid commit interval %d, "
" must be smaller than %d " ,
2021-10-27 17:18:48 +03:00
result . uint_32 , INT_MAX / HZ ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2019-08-28 18:25:01 +03:00
}
2021-10-27 17:18:52 +03:00
ctx - > s_commit_interval = HZ * result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_commit_interval ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_debug_want_extra_isize :
2021-10-27 17:18:52 +03:00
if ( ( result . uint_32 & 1 ) | | ( result . uint_32 < 4 ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR ,
2021-10-27 17:18:48 +03:00
" Invalid want_extra_isize %d " , result . uint_32 ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2019-12-15 09:09:03 +03:00
}
2021-10-27 17:18:52 +03:00
ctx - > s_want_extra_isize = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_want_extra_isize ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_max_batch_time :
2021-10-27 17:18:52 +03:00
ctx - > s_max_batch_time = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_max_batch_time ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_min_batch_time :
2021-10-27 17:18:52 +03:00
ctx - > s_min_batch_time = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_min_batch_time ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_inode_readahead_blks :
2021-10-27 17:18:48 +03:00
if ( result . uint_32 & &
( result . uint_32 > ( 1 < < 30 ) | |
! is_power_of_2 ( result . uint_32 ) ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR ,
2013-02-03 08:14:31 +04:00
" EXT4-fs: inode_readahead_blks must be "
" 0 or a power of 2 smaller than 2^31 " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2013-02-03 08:09:36 +04:00
}
2021-10-27 17:18:52 +03:00
ctx - > s_inode_readahead_blks = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_inode_readahead_blks ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_init_itable :
2021-10-27 17:18:52 +03:00
ctx_set_mount_opt ( ctx , EXT4_MOUNT_INIT_INODE_TABLE ) ;
ctx - > s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT ;
2021-10-27 17:18:48 +03:00
if ( param - > type = = fs_value_is_string )
2021-10-27 17:18:52 +03:00
ctx - > s_li_wait_mult = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_li_wait_mult ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_max_dir_size_kb :
2021-10-27 17:18:52 +03:00
ctx - > s_max_dir_size_kb = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_max_dir_size_kb ;
2021-10-27 17:18:57 +03:00
return 0 ;
2020-10-15 23:37:59 +03:00
# ifdef CONFIG_EXT4_DEBUG
2021-10-27 17:18:57 +03:00
case Opt_fc_debug_max_replay :
2021-10-27 17:18:52 +03:00
ctx - > s_fc_debug_max_replay = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_fc_debug_max_replay ;
2021-10-27 17:18:57 +03:00
return 0 ;
2020-10-15 23:37:59 +03:00
# endif
2021-10-27 17:18:57 +03:00
case Opt_stripe :
2021-10-27 17:18:52 +03:00
ctx - > s_stripe = result . uint_32 ;
ctx - > spec | = EXT4_SPEC_s_stripe ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_resuid :
2021-10-27 17:18:48 +03:00
uid = make_kuid ( current_user_ns ( ) , result . uint_32 ) ;
2013-02-03 08:09:36 +04:00
if ( ! uid_valid ( uid ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " Invalid uid value %d " ,
2021-10-27 17:18:48 +03:00
result . uint_32 ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2012-03-04 08:20:47 +04:00
}
2021-10-27 17:18:52 +03:00
ctx - > s_resuid = uid ;
ctx - > spec | = EXT4_SPEC_s_resuid ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_resgid :
2021-10-27 17:18:48 +03:00
gid = make_kgid ( current_user_ns ( ) , result . uint_32 ) ;
2013-02-03 08:09:36 +04:00
if ( ! gid_valid ( gid ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " Invalid gid value %d " ,
2021-10-27 17:18:48 +03:00
result . uint_32 ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2013-02-03 08:09:36 +04:00
}
2021-10-27 17:18:52 +03:00
ctx - > s_resgid = gid ;
ctx - > spec | = EXT4_SPEC_s_resgid ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_journal_dev :
2013-02-03 08:09:36 +04:00
if ( is_remount ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR ,
2013-02-03 08:09:36 +04:00
" Cannot specify journal on remount " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2013-02-03 08:09:36 +04:00
}
2021-10-27 17:18:48 +03:00
ctx - > journal_devnum = result . uint_32 ;
2021-10-27 17:18:52 +03:00
ctx - > spec | = EXT4_SPEC_JOURNAL_DEV ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_journal_path :
{
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
struct inode * journal_inode ;
struct path path ;
int error ;
if ( is_remount ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR ,
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
" Cannot specify journal on remount " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
}
2021-10-27 17:18:48 +03:00
error = fs_lookup_param ( fc , param , 1 , & path ) ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
if ( error ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " error: could not find "
2021-10-27 17:18:48 +03:00
" journal device path " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
}
2015-03-18 01:25:59 +03:00
journal_inode = d_inode ( path . dentry ) ;
2021-10-27 17:18:48 +03:00
ctx - > journal_devnum = new_encode_dev ( journal_inode - > i_rdev ) ;
2021-10-27 17:18:52 +03:00
ctx - > spec | = EXT4_SPEC_JOURNAL_DEV ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
path_put ( & path ) ;
2021-10-27 17:18:57 +03:00
return 0 ;
}
case Opt_journal_ioprio :
2021-10-27 17:18:48 +03:00
if ( result . uint_32 > 7 ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " Invalid journal IO priority "
2013-02-03 08:09:36 +04:00
" (must be 0-7) " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2013-02-03 08:09:36 +04:00
}
2021-10-27 17:18:48 +03:00
ctx - > journal_ioprio =
IOPRIO_PRIO_VALUE ( IOPRIO_CLASS_BE , result . uint_32 ) ;
2021-10-27 17:18:52 +03:00
ctx - > spec | = EXT4_SPEC_JOURNAL_IOPRIO ;
2021-10-27 17:18:57 +03:00
return 0 ;
case Opt_test_dummy_encryption :
2022-05-26 07:04:12 +03:00
return ext4_parse_test_dummy_encryption ( param , ctx ) ;
2021-10-27 17:18:57 +03:00
case Opt_dax :
case Opt_dax_type :
2015-09-29 22:48:11 +03:00
# ifdef CONFIG_FS_DAX
2021-10-27 17:18:57 +03:00
{
int type = ( token = = Opt_dax ) ?
Opt_dax : result . uint_32 ;
switch ( type ) {
2020-05-28 18:00:00 +03:00
case Opt_dax :
case Opt_dax_always :
2021-10-27 17:18:57 +03:00
ctx_set_mount_opt ( ctx , EXT4_MOUNT_DAX_ALWAYS ) ;
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_NEVER ) ;
2020-05-28 18:00:00 +03:00
break ;
case Opt_dax_never :
2021-10-27 17:18:57 +03:00
ctx_set_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_NEVER ) ;
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_DAX_ALWAYS ) ;
2020-05-28 18:00:00 +03:00
break ;
case Opt_dax_inode :
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_DAX_ALWAYS ) ;
ctx_clear_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_NEVER ) ;
2020-05-28 18:00:00 +03:00
/* Strictly for printing options */
2021-10-27 17:18:57 +03:00
ctx_set_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_INODE ) ;
2020-05-28 18:00:00 +03:00
break ;
}
2021-10-27 17:18:57 +03:00
return 0 ;
}
2015-09-29 22:48:11 +03:00
# else
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_INFO , " dax option not supported " ) ;
return - EINVAL ;
2013-02-03 08:09:36 +04:00
# endif
2021-10-27 17:18:57 +03:00
case Opt_data_err :
if ( result . uint_32 = = Opt_data_err_abort )
ctx_set_mount_opt ( ctx , m - > mount_opt ) ;
else if ( result . uint_32 = = Opt_data_err_ignore )
ctx_clear_mount_opt ( ctx , m - > mount_opt ) ;
return 0 ;
case Opt_mb_optimize_scan :
2022-03-08 12:52:00 +03:00
if ( result . int_32 = = 1 ) {
ctx_set_mount_opt2 ( ctx , EXT4_MOUNT2_MB_OPTIMIZE_SCAN ) ;
ctx - > spec | = EXT4_SPEC_mb_optimize_scan ;
} else if ( result . int_32 = = 0 ) {
ctx_clear_mount_opt2 ( ctx , EXT4_MOUNT2_MB_OPTIMIZE_SCAN ) ;
ctx - > spec | = EXT4_SPEC_mb_optimize_scan ;
} else {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_WARNING ,
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 20:21:27 +03:00
" mb_optimize_scan should be set to 0 or 1. " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 20:21:27 +03:00
}
2021-10-27 17:18:57 +03:00
return 0 ;
}
/*
* At this point we should only be getting options requiring MOPT_SET ,
* or MOPT_CLEAR . Anything else is a bug
*/
if ( m - > token = = Opt_err ) {
ext4_msg ( NULL , KERN_WARNING , " buggy handling of option %s " ,
param - > key ) ;
WARN_ON ( 1 ) ;
return - EINVAL ;
}
else {
2021-10-27 17:18:48 +03:00
unsigned int set = 0 ;
if ( ( param - > type = = fs_value_is_flag ) | |
result . uint_32 > 0 )
set = 1 ;
2013-02-03 08:09:36 +04:00
if ( m - > flags & MOPT_CLEAR )
2021-10-27 17:18:48 +03:00
set = ! set ;
2013-02-03 08:09:36 +04:00
else if ( unlikely ( ! ( m - > flags & MOPT_SET ) ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_WARNING ,
2021-10-27 17:18:48 +03:00
" buggy handling of option %s " ,
param - > key ) ;
2013-02-03 08:09:36 +04:00
WARN_ON ( 1 ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2013-02-03 08:09:36 +04:00
}
2020-10-15 23:37:54 +03:00
if ( m - > flags & MOPT_2 ) {
2021-10-27 17:18:48 +03:00
if ( set ! = 0 )
2021-10-27 17:18:52 +03:00
ctx_set_mount_opt2 ( ctx , m - > mount_opt ) ;
2020-10-15 23:37:54 +03:00
else
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt2 ( ctx , m - > mount_opt ) ;
2020-10-15 23:37:54 +03:00
} else {
2021-10-27 17:18:48 +03:00
if ( set ! = 0 )
2021-10-27 17:18:52 +03:00
ctx_set_mount_opt ( ctx , m - > mount_opt ) ;
2020-10-15 23:37:54 +03:00
else
2021-10-27 17:18:52 +03:00
ctx_clear_mount_opt ( ctx , m - > mount_opt ) ;
2020-10-15 23:37:54 +03:00
}
2012-03-04 08:20:47 +04:00
}
2021-10-27 17:18:57 +03:00
2021-10-27 17:18:54 +03:00
return 0 ;
2012-03-04 08:20:47 +04:00
}
2021-10-27 17:18:53 +03:00
static int parse_options ( struct fs_context * fc , char * options )
2012-03-04 08:20:47 +04:00
{
2021-10-27 17:18:48 +03:00
struct fs_parameter param ;
int ret ;
char * key ;
2012-03-04 08:20:47 +04:00
if ( ! options )
2021-10-27 17:18:53 +03:00
return 0 ;
2021-10-27 17:18:48 +03:00
while ( ( key = strsep ( & options , " , " ) ) ! = NULL ) {
if ( * key ) {
size_t v_len = 0 ;
char * value = strchr ( key , ' = ' ) ;
param . type = fs_value_is_flag ;
param . string = NULL ;
if ( value ) {
if ( value = = key )
continue ;
* value + + = 0 ;
v_len = strlen ( value ) ;
param . string = kmemdup_nul ( value , v_len ,
GFP_KERNEL ) ;
if ( ! param . string )
2021-10-27 17:18:53 +03:00
return - ENOMEM ;
2021-10-27 17:18:48 +03:00
param . type = fs_value_is_string ;
}
param . key = key ;
param . size = v_len ;
2021-10-27 17:18:54 +03:00
ret = ext4_parse_param ( fc , & param ) ;
2021-10-27 17:18:48 +03:00
if ( param . string )
kfree ( param . string ) ;
if ( ret < 0 )
2021-10-27 17:18:53 +03:00
return ret ;
2021-10-27 17:18:48 +03:00
}
2006-10-11 12:20:50 +04:00
}
2021-10-27 17:18:48 +03:00
2021-10-27 17:18:53 +03:00
ret = ext4_validate_options ( fc ) ;
2021-10-27 17:18:49 +03:00
if ( ret < 0 )
2021-10-27 17:18:53 +03:00
return ret ;
2021-10-27 17:18:49 +03:00
2021-10-27 17:18:53 +03:00
return 0 ;
}
static int parse_apply_sb_mount_options ( struct super_block * sb ,
struct ext4_fs_context * m_ctx )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
char * s_mount_opts = NULL ;
struct ext4_fs_context * s_ctx = NULL ;
struct fs_context * fc = NULL ;
int ret = - ENOMEM ;
if ( ! sbi - > s_es - > s_mount_opts [ 0 ] )
2021-10-27 17:18:50 +03:00
return 0 ;
2021-10-27 17:18:53 +03:00
s_mount_opts = kstrndup ( sbi - > s_es - > s_mount_opts ,
sizeof ( sbi - > s_es - > s_mount_opts ) ,
GFP_KERNEL ) ;
if ( ! s_mount_opts )
return ret ;
fc = kzalloc ( sizeof ( struct fs_context ) , GFP_KERNEL ) ;
if ( ! fc )
goto out_free ;
s_ctx = kzalloc ( sizeof ( struct ext4_fs_context ) , GFP_KERNEL ) ;
if ( ! s_ctx )
goto out_free ;
fc - > fs_private = s_ctx ;
fc - > s_fs_info = sbi ;
ret = parse_options ( fc , s_mount_opts ) ;
2021-10-27 17:18:52 +03:00
if ( ret < 0 )
2021-10-27 17:18:53 +03:00
goto parse_failed ;
2021-10-27 17:18:50 +03:00
2021-10-27 17:18:53 +03:00
ret = ext4_check_opt_consistency ( fc , sb ) ;
if ( ret < 0 ) {
parse_failed :
ext4_msg ( sb , KERN_WARNING ,
" failed to parse options in superblock: %s " ,
s_mount_opts ) ;
ret = 0 ;
goto out_free ;
}
if ( s_ctx - > spec & EXT4_SPEC_JOURNAL_DEV )
m_ctx - > journal_devnum = s_ctx - > journal_devnum ;
if ( s_ctx - > spec & EXT4_SPEC_JOURNAL_IOPRIO )
m_ctx - > journal_ioprio = s_ctx - > journal_ioprio ;
2022-05-26 07:04:12 +03:00
ext4_apply_options ( fc , sb ) ;
ret = 0 ;
2021-10-27 17:18:53 +03:00
out_free :
2022-05-14 02:16:01 +03:00
if ( fc ) {
ext4_fc_free ( fc ) ;
kfree ( fc ) ;
}
2021-10-27 17:18:53 +03:00
kfree ( s_mount_opts ) ;
return ret ;
2021-10-27 17:18:47 +03:00
}
2021-10-27 17:18:50 +03:00
static void ext4_apply_quota_options ( struct fs_context * fc ,
struct super_block * sb )
{
# ifdef CONFIG_QUOTA
2021-10-27 17:18:52 +03:00
bool quota_feature = ext4_has_feature_quota ( sb ) ;
2021-10-27 17:18:50 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
char * qname ;
int i ;
2021-10-27 17:18:52 +03:00
if ( quota_feature )
return ;
if ( ctx - > spec & EXT4_SPEC_JQUOTA ) {
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
if ( ! ( ctx - > qname_spec & ( 1 < < i ) ) )
continue ;
qname = ctx - > s_qf_names [ i ] ; /* May be NULL */
2022-01-04 17:35:18 +03:00
if ( qname )
set_opt ( sb , QUOTA ) ;
2021-10-27 17:18:52 +03:00
ctx - > s_qf_names [ i ] = NULL ;
2022-01-04 17:35:17 +03:00
qname = rcu_replace_pointer ( sbi - > s_qf_names [ i ] , qname ,
lockdep_is_held ( & sb - > s_umount ) ) ;
if ( qname )
kfree_rcu ( qname ) ;
2021-10-27 17:18:52 +03:00
}
2021-10-27 17:18:50 +03:00
}
2021-10-27 17:18:52 +03:00
if ( ctx - > spec & EXT4_SPEC_JQFMT )
sbi - > s_jquota_fmt = ctx - > s_jquota_fmt ;
2021-10-27 17:18:50 +03:00
# endif
}
/*
* Check quota settings consistency .
*/
static int ext4_check_quota_consistency ( struct fs_context * fc ,
struct super_block * sb )
{
# ifdef CONFIG_QUOTA
struct ext4_fs_context * ctx = fc - > fs_private ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
bool quota_feature = ext4_has_feature_quota ( sb ) ;
bool quota_loaded = sb_any_quota_loaded ( sb ) ;
2021-10-27 17:18:52 +03:00
bool usr_qf_name , grp_qf_name , usrquota , grpquota ;
int quota_flags , i ;
/*
* We do the test below only for project quotas . ' usrquota ' and
* ' grpquota ' mount options are allowed even without quota feature
* to support legacy quotas in quota files .
*/
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_PRJQUOTA ) & &
! ext4_has_feature_project ( sb ) ) {
ext4_msg ( NULL , KERN_ERR , " Project quota feature not enabled. "
" Cannot enable project quota enforcement. " ) ;
return - EINVAL ;
}
2021-10-27 17:18:50 +03:00
2021-10-27 17:18:52 +03:00
quota_flags = EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA ;
if ( quota_loaded & &
ctx - > mask_s_mount_opt & quota_flags & &
! ctx_test_mount_opt ( ctx , quota_flags ) )
goto err_quota_change ;
if ( ctx - > spec & EXT4_SPEC_JQUOTA ) {
2021-10-27 17:18:50 +03:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
if ( ! ( ctx - > qname_spec & ( 1 < < i ) ) )
continue ;
2021-10-27 17:18:52 +03:00
if ( quota_loaded & &
! ! sbi - > s_qf_names [ i ] ! = ! ! ctx - > s_qf_names [ i ] )
2021-10-27 17:18:50 +03:00
goto err_jquota_change ;
if ( sbi - > s_qf_names [ i ] & & ctx - > s_qf_names [ i ] & &
2022-01-04 17:35:17 +03:00
strcmp ( get_qf_name ( sb , sbi , i ) ,
2021-10-27 17:18:50 +03:00
ctx - > s_qf_names [ i ] ) ! = 0 )
goto err_jquota_specified ;
}
2021-10-27 17:18:52 +03:00
if ( quota_feature ) {
ext4_msg ( NULL , KERN_INFO ,
" Journaled quota options ignored when "
" QUOTA feature is enabled " ) ;
return 0 ;
}
2021-10-27 17:18:50 +03:00
}
2021-10-27 17:18:52 +03:00
if ( ctx - > spec & EXT4_SPEC_JQFMT ) {
2021-10-27 17:18:50 +03:00
if ( sbi - > s_jquota_fmt ! = ctx - > s_jquota_fmt & & quota_loaded )
2021-10-27 17:18:52 +03:00
goto err_jquota_change ;
2021-10-27 17:18:50 +03:00
if ( quota_feature ) {
ext4_msg ( NULL , KERN_INFO , " Quota format mount options "
" ignored when QUOTA feature is enabled " ) ;
return 0 ;
}
}
2021-10-27 17:18:52 +03:00
/* Make sure we don't mix old and new quota format */
usr_qf_name = ( get_qf_name ( sb , sbi , USRQUOTA ) | |
ctx - > s_qf_names [ USRQUOTA ] ) ;
grp_qf_name = ( get_qf_name ( sb , sbi , GRPQUOTA ) | |
ctx - > s_qf_names [ GRPQUOTA ] ) ;
usrquota = ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_USRQUOTA ) | |
test_opt ( sb , USRQUOTA ) ) ;
grpquota = ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_GRPQUOTA ) | |
test_opt ( sb , GRPQUOTA ) ) ;
if ( usr_qf_name ) {
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_USRQUOTA ) ;
usrquota = false ;
}
if ( grp_qf_name ) {
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_GRPQUOTA ) ;
grpquota = false ;
}
if ( usr_qf_name | | grp_qf_name ) {
if ( usrquota | | grpquota ) {
ext4_msg ( NULL , KERN_ERR , " old and new quota "
" format mixing " ) ;
return - EINVAL ;
}
if ( ! ( ctx - > spec & EXT4_SPEC_JQFMT | | sbi - > s_jquota_fmt ) ) {
ext4_msg ( NULL , KERN_ERR , " journaled quota format "
" not specified " ) ;
return - EINVAL ;
}
}
2021-10-27 17:18:50 +03:00
return 0 ;
err_quota_change :
ext4_msg ( NULL , KERN_ERR ,
" Cannot change quota options when quota turned on " ) ;
return - EINVAL ;
err_jquota_change :
ext4_msg ( NULL , KERN_ERR , " Cannot change journaled quota "
" options when quota turned on " ) ;
return - EINVAL ;
err_jquota_specified :
ext4_msg ( NULL , KERN_ERR , " %s quota file already specified " ,
QTYPE2NAME ( i ) ) ;
return - EINVAL ;
# else
return 0 ;
# endif
}
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
static int ext4_check_test_dummy_encryption ( const struct fs_context * fc ,
struct super_block * sb )
{
const struct ext4_fs_context * ctx = fc - > fs_private ;
const struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2022-05-26 07:04:12 +03:00
int err ;
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
2022-05-26 07:04:12 +03:00
if ( ! fscrypt_is_dummy_policy_set ( & ctx - > dummy_enc_policy ) )
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
return 0 ;
if ( ! ext4_has_feature_encrypt ( sb ) ) {
ext4_msg ( NULL , KERN_WARNING ,
" test_dummy_encryption requires encrypt feature " ) ;
return - EINVAL ;
}
/*
* This mount option is just for testing , and it ' s not worthwhile to
* implement the extra complexity ( e . g . RCU protection ) that would be
* needed to allow it to be set or changed during remount . We do allow
* it to be specified during remount , but only if there is no change .
*/
2022-05-26 07:04:12 +03:00
if ( fc - > purpose = = FS_CONTEXT_FOR_RECONFIGURE ) {
if ( fscrypt_dummy_policies_equal ( & sbi - > s_dummy_enc_policy ,
& ctx - > dummy_enc_policy ) )
return 0 ;
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
ext4_msg ( NULL , KERN_WARNING ,
2022-05-26 07:04:12 +03:00
" Can't set or change test_dummy_encryption on remount " ) ;
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
return - EINVAL ;
}
2022-05-26 07:04:12 +03:00
/* Also make sure s_mount_opts didn't contain a conflicting value. */
if ( fscrypt_is_dummy_policy_set ( & sbi - > s_dummy_enc_policy ) ) {
if ( fscrypt_dummy_policies_equal ( & sbi - > s_dummy_enc_policy ,
& ctx - > dummy_enc_policy ) )
return 0 ;
ext4_msg ( NULL , KERN_WARNING ,
" Conflicting test_dummy_encryption options " ) ;
return - EINVAL ;
}
/*
* fscrypt_add_test_dummy_key ( ) technically changes the super_block , so
* technically it should be delayed until ext4_apply_options ( ) like the
* other changes . But since we never get here for remounts ( see above ) ,
* and this is the last chance to report errors , we do it here .
*/
err = fscrypt_add_test_dummy_key ( sb , & ctx - > dummy_enc_policy ) ;
if ( err )
ext4_msg ( NULL , KERN_WARNING ,
" Error adding test dummy encryption key [%d] " , err ) ;
return err ;
}
static void ext4_apply_test_dummy_encryption ( struct ext4_fs_context * ctx ,
struct super_block * sb )
{
if ( ! fscrypt_is_dummy_policy_set ( & ctx - > dummy_enc_policy ) | |
/* if already set, it was already verified to be the same */
fscrypt_is_dummy_policy_set ( & EXT4_SB ( sb ) - > s_dummy_enc_policy ) )
return ;
EXT4_SB ( sb ) - > s_dummy_enc_policy = ctx - > dummy_enc_policy ;
memset ( & ctx - > dummy_enc_policy , 0 , sizeof ( ctx - > dummy_enc_policy ) ) ;
ext4_msg ( sb , KERN_WARNING , " Test dummy encryption mode enabled " ) ;
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
}
2021-10-27 17:18:51 +03:00
static int ext4_check_opt_consistency ( struct fs_context * fc ,
struct super_block * sb )
{
struct ext4_fs_context * ctx = fc - > fs_private ;
2021-10-27 17:18:52 +03:00
struct ext4_sb_info * sbi = fc - > s_fs_info ;
int is_remount = fc - > purpose = = FS_CONTEXT_FOR_RECONFIGURE ;
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
int err ;
2021-10-27 17:18:51 +03:00
if ( ( ctx - > opt_flags & MOPT_NO_EXT2 ) & & IS_EXT2_SB ( sb ) ) {
ext4_msg ( NULL , KERN_ERR ,
" Mount option(s) incompatible with ext2 " ) ;
return - EINVAL ;
}
if ( ( ctx - > opt_flags & MOPT_NO_EXT3 ) & & IS_EXT3_SB ( sb ) ) {
ext4_msg ( NULL , KERN_ERR ,
" Mount option(s) incompatible with ext3 " ) ;
return - EINVAL ;
}
2021-10-27 17:18:52 +03:00
if ( ctx - > s_want_extra_isize >
( sbi - > s_inode_size - EXT4_GOOD_OLD_INODE_SIZE ) ) {
ext4_msg ( NULL , KERN_ERR ,
" Invalid want_extra_isize %d " ,
ctx - > s_want_extra_isize ) ;
return - EINVAL ;
}
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_DIOREAD_NOLOCK ) ) {
int blocksize =
BLOCK_SIZE < < le32_to_cpu ( sbi - > s_es - > s_log_block_size ) ;
if ( blocksize < PAGE_SIZE )
ext4_msg ( NULL , KERN_WARNING , " Warning: mounting with an "
" experimental mount option 'dioread_nolock' "
" for blocksize < PAGE_SIZE " ) ;
}
ext4: only allow test_dummy_encryption when supported
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-19 23:44:37 +03:00
err = ext4_check_test_dummy_encryption ( fc , sb ) ;
if ( err )
return err ;
2021-10-27 17:18:52 +03:00
if ( ( ctx - > spec & EXT4_SPEC_DATAJ ) & & is_remount ) {
if ( ! sbi - > s_journal ) {
ext4_msg ( NULL , KERN_WARNING ,
" Remounting file system with no journal "
" so ignoring journalled data option " ) ;
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_DATA_FLAGS ) ;
2021-12-20 18:26:57 +03:00
} else if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_DATA_FLAGS ) ! =
test_opt ( sb , DATA_FLAGS ) ) {
2021-10-27 17:18:52 +03:00
ext4_msg ( NULL , KERN_ERR , " Cannot change data mode "
" on remount " ) ;
return - EINVAL ;
}
}
if ( is_remount ) {
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_DAX_ALWAYS ) & &
( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA ) ) {
ext4_msg ( NULL , KERN_ERR , " can't mount with "
" both data=journal and dax " ) ;
return - EINVAL ;
}
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_DAX_ALWAYS ) & &
( ! ( sbi - > s_mount_opt & EXT4_MOUNT_DAX_ALWAYS ) | |
( sbi - > s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER ) ) ) {
fail_dax_change_remount :
ext4_msg ( NULL , KERN_ERR , " can't change "
" dax mount option while remounting " ) ;
return - EINVAL ;
} else if ( ctx_test_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_NEVER ) & &
( ! ( sbi - > s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER ) | |
( sbi - > s_mount_opt & EXT4_MOUNT_DAX_ALWAYS ) ) ) {
goto fail_dax_change_remount ;
} else if ( ctx_test_mount_opt2 ( ctx , EXT4_MOUNT2_DAX_INODE ) & &
( ( sbi - > s_mount_opt & EXT4_MOUNT_DAX_ALWAYS ) | |
( sbi - > s_mount_opt2 & EXT4_MOUNT2_DAX_NEVER ) | |
! ( sbi - > s_mount_opt2 & EXT4_MOUNT2_DAX_INODE ) ) ) {
goto fail_dax_change_remount ;
}
}
2021-10-27 17:18:51 +03:00
return ext4_check_quota_consistency ( fc , sb ) ;
}
2022-05-26 07:04:12 +03:00
static void ext4_apply_options ( struct fs_context * fc , struct super_block * sb )
2021-10-27 17:18:47 +03:00
{
2021-10-27 17:18:52 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
2021-10-27 17:18:49 +03:00
struct ext4_sb_info * sbi = fc - > s_fs_info ;
2021-10-27 17:18:52 +03:00
sbi - > s_mount_opt & = ~ ctx - > mask_s_mount_opt ;
sbi - > s_mount_opt | = ctx - > vals_s_mount_opt ;
sbi - > s_mount_opt2 & = ~ ctx - > mask_s_mount_opt2 ;
sbi - > s_mount_opt2 | = ctx - > vals_s_mount_opt2 ;
sbi - > s_mount_flags & = ~ ctx - > mask_s_mount_flags ;
sbi - > s_mount_flags | = ctx - > vals_s_mount_flags ;
sb - > s_flags & = ~ ctx - > mask_s_flags ;
sb - > s_flags | = ctx - > vals_s_flags ;
2021-12-22 13:45:17 +03:00
/*
* i_version differs from common mount option iversion so we have
* to let vfs know that it was set , otherwise it would get cleared
* on remount
*/
if ( ctx - > mask_s_flags & SB_I_VERSION )
fc - > sb_flags | = SB_I_VERSION ;
2021-10-27 17:18:52 +03:00
# define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
APPLY ( s_commit_interval ) ;
APPLY ( s_stripe ) ;
APPLY ( s_max_batch_time ) ;
APPLY ( s_min_batch_time ) ;
APPLY ( s_want_extra_isize ) ;
APPLY ( s_inode_readahead_blks ) ;
APPLY ( s_max_dir_size_kb ) ;
APPLY ( s_li_wait_mult ) ;
APPLY ( s_resgid ) ;
APPLY ( s_resuid ) ;
# ifdef CONFIG_EXT4_DEBUG
APPLY ( s_fc_debug_max_replay ) ;
# endif
ext4_apply_quota_options ( fc , sb ) ;
2022-05-26 07:04:12 +03:00
ext4_apply_test_dummy_encryption ( ctx , sb ) ;
2021-10-27 17:18:52 +03:00
}
static int ext4_validate_options ( struct fs_context * fc )
{
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2021-10-27 17:18:52 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
2021-10-27 17:18:47 +03:00
char * usr_qf_name , * grp_qf_name ;
2021-10-27 17:18:52 +03:00
usr_qf_name = ctx - > s_qf_names [ USRQUOTA ] ;
grp_qf_name = ctx - > s_qf_names [ GRPQUOTA ] ;
2018-10-12 16:28:09 +03:00
if ( usr_qf_name | | grp_qf_name ) {
2021-10-27 17:18:52 +03:00
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_USRQUOTA ) & & usr_qf_name )
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_USRQUOTA ) ;
2006-10-11 12:20:50 +04:00
2021-10-27 17:18:52 +03:00
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_GRPQUOTA ) & & grp_qf_name )
ctx_clear_mount_opt ( ctx , EXT4_MOUNT_GRPQUOTA ) ;
2006-10-11 12:20:50 +04:00
2021-10-27 17:18:52 +03:00
if ( ctx_test_mount_opt ( ctx , EXT4_MOUNT_USRQUOTA ) | |
ctx_test_mount_opt ( ctx , EXT4_MOUNT_GRPQUOTA ) ) {
2021-10-27 17:18:49 +03:00
ext4_msg ( NULL , KERN_ERR , " old and new quota "
2021-10-27 17:18:52 +03:00
" format mixing " ) ;
2021-10-27 17:18:49 +03:00
return - EINVAL ;
2006-10-11 12:20:50 +04:00
}
}
# endif
2021-10-27 17:18:52 +03:00
return 1 ;
2006-10-11 12:20:50 +04:00
}
2012-03-04 08:20:50 +04:00
static inline void ext4_show_quota_options ( struct seq_file * seq ,
struct super_block * sb )
{
# if defined(CONFIG_QUOTA)
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2018-10-12 16:28:09 +03:00
char * usr_qf_name , * grp_qf_name ;
2012-03-04 08:20:50 +04:00
if ( sbi - > s_jquota_fmt ) {
char * fmtname = " " ;
switch ( sbi - > s_jquota_fmt ) {
case QFMT_VFS_OLD :
fmtname = " vfsold " ;
break ;
case QFMT_VFS_V0 :
fmtname = " vfsv0 " ;
break ;
case QFMT_VFS_V1 :
fmtname = " vfsv1 " ;
break ;
}
seq_printf ( seq , " ,jqfmt=%s " , fmtname ) ;
}
2018-10-12 16:28:09 +03:00
rcu_read_lock ( ) ;
usr_qf_name = rcu_dereference ( sbi - > s_qf_names [ USRQUOTA ] ) ;
grp_qf_name = rcu_dereference ( sbi - > s_qf_names [ GRPQUOTA ] ) ;
if ( usr_qf_name )
seq_show_option ( seq , " usrjquota " , usr_qf_name ) ;
if ( grp_qf_name )
seq_show_option ( seq , " grpjquota " , grp_qf_name ) ;
rcu_read_unlock ( ) ;
2012-03-04 08:20:50 +04:00
# endif
}
2012-03-05 04:27:31 +04:00
static const char * token2str ( int token )
{
2021-10-27 17:18:55 +03:00
const struct fs_parameter_spec * spec ;
2012-03-05 04:27:31 +04:00
2021-10-27 17:18:55 +03:00
for ( spec = ext4_param_specs ; spec - > name ! = NULL ; spec + + )
if ( spec - > opt = = token & & ! spec - > type )
2012-03-05 04:27:31 +04:00
break ;
2021-10-27 17:18:55 +03:00
return spec - > name ;
2012-03-05 04:27:31 +04:00
}
2012-03-04 08:20:50 +04:00
/*
* Show an option if
* - it ' s set to a non - default value OR
* - if the per - sb default is different from the global default
*/
2012-03-05 05:21:38 +04:00
static int _ext4_show_options ( struct seq_file * seq , struct super_block * sb ,
int nodefs )
2012-03-04 08:20:50 +04:00
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2018-03-30 07:51:10 +03:00
int def_errors , def_mount_opt = sbi - > s_def_mount_opt ;
2012-03-05 04:27:31 +04:00
const struct mount_opts * m ;
2012-03-05 05:21:38 +04:00
char sep = nodefs ? ' \n ' : ' , ' ;
2012-03-04 08:20:50 +04:00
2012-03-05 05:21:38 +04:00
# define SEQ_OPTS_PUTS(str) seq_printf(seq, "%c" str, sep)
# define SEQ_OPTS_PRINT(str, arg) seq_printf(seq, "%c" str, sep, arg)
2012-03-04 08:20:50 +04:00
if ( sbi - > s_sb_block ! = 1 )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " sb=%llu " , sbi - > s_sb_block ) ;
for ( m = ext4_mount_opts ; m - > token ! = Opt_err ; m + + ) {
int want_set = m - > flags & MOPT_SET ;
if ( ( ( m - > flags & ( MOPT_SET | MOPT_CLEAR ) ) = = 0 ) | |
2021-10-27 17:18:57 +03:00
m - > flags & MOPT_SKIP )
2012-03-05 04:27:31 +04:00
continue ;
2018-03-30 07:51:10 +03:00
if ( ! nodefs & & ! ( m - > mount_opt & ( sbi - > s_mount_opt ^ def_mount_opt ) ) )
2012-03-05 04:27:31 +04:00
continue ; /* skip if same as the default */
if ( ( want_set & &
( sbi - > s_mount_opt & m - > mount_opt ) ! = m - > mount_opt ) | |
( ! want_set & & ( sbi - > s_mount_opt & m - > mount_opt ) ) )
continue ; /* select Opt_noFoo vs Opt_Foo */
SEQ_OPTS_PRINT ( " %s " , token2str ( m - > token ) ) ;
2012-03-04 08:20:50 +04:00
}
2012-03-05 04:27:31 +04:00
2012-02-08 03:41:49 +04:00
if ( nodefs | | ! uid_eq ( sbi - > s_resuid , make_kuid ( & init_user_ns , EXT4_DEF_RESUID ) ) | |
2012-03-05 04:27:31 +04:00
le16_to_cpu ( es - > s_def_resuid ) ! = EXT4_DEF_RESUID )
2012-02-08 03:41:49 +04:00
SEQ_OPTS_PRINT ( " resuid=%u " ,
from_kuid_munged ( & init_user_ns , sbi - > s_resuid ) ) ;
if ( nodefs | | ! gid_eq ( sbi - > s_resgid , make_kgid ( & init_user_ns , EXT4_DEF_RESGID ) ) | |
2012-03-05 04:27:31 +04:00
le16_to_cpu ( es - > s_def_resgid ) ! = EXT4_DEF_RESGID )
2012-02-08 03:41:49 +04:00
SEQ_OPTS_PRINT ( " resgid=%u " ,
from_kgid_munged ( & init_user_ns , sbi - > s_resgid ) ) ;
2012-03-05 05:21:38 +04:00
def_errors = nodefs ? - 1 : le16_to_cpu ( es - > s_errors ) ;
2012-03-05 04:27:31 +04:00
if ( test_opt ( sb , ERRORS_RO ) & & def_errors ! = EXT4_ERRORS_RO )
SEQ_OPTS_PUTS ( " errors=remount-ro " ) ;
2012-03-04 08:20:50 +04:00
if ( test_opt ( sb , ERRORS_CONT ) & & def_errors ! = EXT4_ERRORS_CONTINUE )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " errors=continue " ) ;
2012-03-04 08:20:50 +04:00
if ( test_opt ( sb , ERRORS_PANIC ) & & def_errors ! = EXT4_ERRORS_PANIC )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " errors=panic " ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_commit_interval ! = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " commit=%lu " , sbi - > s_commit_interval / HZ ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_min_batch_time ! = EXT4_DEF_MIN_BATCH_TIME )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " min_batch_time=%u " , sbi - > s_min_batch_time ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_max_batch_time ! = EXT4_DEF_MAX_BATCH_TIME )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " max_batch_time=%u " , sbi - > s_max_batch_time ) ;
2017-10-18 23:56:26 +03:00
if ( sb - > s_flags & SB_I_VERSION )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " i_version " ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_stripe )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " stripe=%lu " , sbi - > s_stripe ) ;
2018-03-30 07:51:10 +03:00
if ( nodefs | | EXT4_MOUNT_DATA_FLAGS &
( sbi - > s_mount_opt ^ def_mount_opt ) ) {
2012-03-05 04:27:31 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA )
SEQ_OPTS_PUTS ( " data=journal " ) ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA )
SEQ_OPTS_PUTS ( " data=ordered " ) ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_WRITEBACK_DATA )
SEQ_OPTS_PUTS ( " data=writeback " ) ;
}
2012-03-05 05:21:38 +04:00
if ( nodefs | |
sbi - > s_inode_readahead_blks ! = EXT4_DEF_INODE_READAHEAD_BLKS )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " inode_readahead_blks=%u " ,
sbi - > s_inode_readahead_blks ) ;
2012-03-04 08:20:50 +04:00
2018-03-30 07:53:33 +03:00
if ( test_opt ( sb , INIT_INODE_TABLE ) & & ( nodefs | |
2012-03-05 05:21:38 +04:00
( sbi - > s_li_wait_mult ! = EXT4_DEF_LI_WAIT_MULT ) ) )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " init_itable=%u " , sbi - > s_li_wait_mult ) ;
2012-08-17 17:48:17 +04:00
if ( nodefs | | sbi - > s_max_dir_size_kb )
SEQ_OPTS_PRINT ( " max_dir_size_kb=%u " , sbi - > s_max_dir_size_kb ) ;
2016-03-13 05:55:50 +03:00
if ( test_opt ( sb , DATA_ERR_ABORT ) )
SEQ_OPTS_PUTS ( " data_err=abort " ) ;
fscrypt: support test_dummy_encryption=v2
v1 encryption policies are deprecated in favor of v2, and some new
features (e.g. encryption+casefolding) are only being added for v2.
Therefore, the "test_dummy_encryption" mount option (which is used for
encryption I/O testing with xfstests) needs to support v2 policies.
To do this, extend its syntax to be "test_dummy_encryption=v1" or
"test_dummy_encryption=v2". The existing "test_dummy_encryption" (no
argument) also continues to be accepted, to specify the default setting
-- currently v1, but the next patch changes it to v2.
To cleanly support both v1 and v2 while also making it easy to support
specifying other encryption settings in the future (say, accepting
"$contents_mode:$filenames_mode:v2"), make ext4 and f2fs maintain a
pointer to the dummy fscrypt_context rather than using mount flags.
To avoid concurrency issues, don't allow test_dummy_encryption to be set
or changed during a remount. (The former restriction is new, but
xfstests doesn't run into it, so no one should notice.)
Tested with 'gce-xfstests -c {ext4,f2fs}/encrypt -g auto'. On ext4,
there are two regressions, both of which are test bugs: ext4/023 and
ext4/028 fail because they set an xattr and expect it to be stored
inline, but the increase in size of the fscrypt_context from
24 to 40 bytes causes this xattr to be spilled into an external block.
Link: https://lore.kernel.org/r/20200512233251.118314-4-ebiggers@kernel.org
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-05-13 02:32:50 +03:00
fscrypt_show_test_dummy_encryption ( seq , sep , sb ) ;
2012-03-04 08:20:50 +04:00
2020-07-02 04:56:07 +03:00
if ( sb - > s_flags & SB_INLINECRYPT )
SEQ_OPTS_PUTS ( " inlinecrypt " ) ;
2020-05-28 18:00:00 +03:00
if ( test_opt ( sb , DAX_ALWAYS ) ) {
if ( IS_EXT2_SB ( sb ) )
SEQ_OPTS_PUTS ( " dax " ) ;
else
SEQ_OPTS_PUTS ( " dax=always " ) ;
} else if ( test_opt2 ( sb , DAX_NEVER ) ) {
SEQ_OPTS_PUTS ( " dax=never " ) ;
} else if ( test_opt2 ( sb , DAX_INODE ) ) {
SEQ_OPTS_PUTS ( " dax=inode " ) ;
}
2022-07-04 08:46:03 +03:00
if ( sbi - > s_groups_count > = MB_DEFAULT_LINEAR_SCAN_THRESHOLD & &
! test_opt2 ( sb , MB_OPTIMIZE_SCAN ) ) {
SEQ_OPTS_PUTS ( " mb_optimize_scan=0 " ) ;
} else if ( sbi - > s_groups_count < MB_DEFAULT_LINEAR_SCAN_THRESHOLD & &
test_opt2 ( sb , MB_OPTIMIZE_SCAN ) ) {
SEQ_OPTS_PUTS ( " mb_optimize_scan=1 " ) ;
}
2012-03-04 08:20:50 +04:00
ext4_show_quota_options ( seq , sb ) ;
return 0 ;
}
2012-03-05 05:21:38 +04:00
static int ext4_show_options ( struct seq_file * seq , struct dentry * root )
{
return _ext4_show_options ( seq , root - > d_sb , 0 ) ;
}
2015-09-23 19:46:17 +03:00
int ext4_seq_options_show ( struct seq_file * seq , void * offset )
2012-03-05 05:21:38 +04:00
{
struct super_block * sb = seq - > private ;
int rc ;
2017-07-17 10:45:34 +03:00
seq_puts ( seq , sb_rdonly ( sb ) ? " ro " : " rw " ) ;
2012-03-05 05:21:38 +04:00
rc = _ext4_show_options ( seq , sb , 1 ) ;
seq_puts ( seq , " \n " ) ;
return rc ;
}
2006-10-11 12:20:53 +04:00
static int ext4_setup_super ( struct super_block * sb , struct ext4_super_block * es ,
2006-10-11 12:20:50 +04:00
int read_only )
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2018-05-14 06:02:19 +03:00
int err = 0 ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) > EXT4_MAX_SUPP_REV ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " revision level too high, "
" forcing read-only mode " ) ;
2018-05-14 06:02:19 +03:00
err = - EROFS ;
2020-06-01 10:34:04 +03:00
goto done ;
2006-10-11 12:20:50 +04:00
}
if ( read_only )
2011-09-10 02:34:51 +04:00
goto done ;
2006-10-11 12:20:53 +04:00
if ( ! ( sbi - > s_mount_state & EXT4_VALID_FS ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " warning: mounting unchecked fs, "
" running e2fsck is recommended " ) ;
2014-05-12 20:55:07 +04:00
else if ( sbi - > s_mount_state & EXT4_ERROR_FS )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: mounting fs with errors, "
" running e2fsck is recommended " ) ;
2011-05-18 21:29:57 +04:00
else if ( ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) > 0 & &
2006-10-11 12:20:50 +04:00
le16_to_cpu ( es - > s_mnt_count ) > =
( unsigned short ) ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: maximal mount count reached, "
" running e2fsck is recommended " ) ;
2006-10-11 12:20:50 +04:00
else if ( le32_to_cpu ( es - > s_checkinterval ) & &
2018-07-29 22:51:48 +03:00
( ext4_get_tstamp ( es , s_lastcheck ) +
le32_to_cpu ( es - > s_checkinterval ) < = ktime_get_real_seconds ( ) ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: checktime reached, "
" running e2fsck is recommended " ) ;
2009-06-04 01:59:28 +04:00
if ( ! sbi - > s_journal )
2009-01-07 08:06:22 +03:00
es - > s_state & = cpu_to_le16 ( ~ EXT4_VALID_FS ) ;
2006-10-11 12:20:50 +04:00
if ( ! ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) )
2006-10-11 12:20:53 +04:00
es - > s_max_mnt_count = cpu_to_le16 ( EXT4_DFL_MAX_MNT_COUNT ) ;
2008-04-17 18:38:59 +04:00
le16_add_cpu ( & es - > s_mnt_count , 1 ) ;
2018-07-29 22:51:48 +03:00
ext4_update_tstamp ( es , s_mtime ) ;
2021-08-16 12:57:06 +03:00
if ( sbi - > s_journal ) {
2015-10-17 23:18:43 +03:00
ext4_set_feature_journal_needs_recovery ( sb ) ;
2021-08-16 12:57:06 +03:00
if ( ext4_has_feature_orphan_file ( sb ) )
ext4_set_feature_orphan_present ( sb ) ;
}
2006-10-11 12:20:50 +04:00
2020-12-16 13:18:38 +03:00
err = ext4_commit_super ( sb ) ;
2011-09-10 02:34:51 +04:00
done :
2006-10-11 12:20:50 +04:00
if ( test_opt ( sb , DEBUG ) )
2009-01-06 06:18:16 +03:00
printk ( KERN_INFO " [EXT4 FS bs=%lu, gc=%u, "
2010-12-16 04:30:48 +03:00
" bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x] \n " ,
2006-10-11 12:20:50 +04:00
sb - > s_blocksize ,
sbi - > s_groups_count ,
2006-10-11 12:20:53 +04:00
EXT4_BLOCKS_PER_GROUP ( sb ) ,
EXT4_INODES_PER_GROUP ( sb ) ,
2010-12-16 04:30:48 +03:00
sbi - > s_mount_opt , sbi - > s_mount_opt2 ) ;
2018-05-14 06:02:19 +03:00
return err ;
2006-10-11 12:20:50 +04:00
}
2012-09-05 09:29:50 +04:00
int ext4_alloc_flex_bg_array ( struct super_block * sb , ext4_group_t ngroup )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2020-02-19 06:08:51 +03:00
struct flex_groups * * old_groups , * * new_groups ;
2020-02-28 12:22:56 +03:00
int size , i , j ;
2012-09-05 09:29:50 +04:00
if ( ! sbi - > s_log_groups_per_flex )
return 0 ;
size = ext4_flex_group ( sbi , ngroup - 1 ) + 1 ;
if ( size < = sbi - > s_flex_groups_allocated )
return 0 ;
2020-02-19 06:08:51 +03:00
new_groups = kvzalloc ( roundup_pow_of_two ( size *
sizeof ( * sbi - > s_flex_groups ) ) , GFP_KERNEL ) ;
2012-09-05 09:29:50 +04:00
if ( ! new_groups ) {
2020-02-19 06:08:51 +03:00
ext4_msg ( sb , KERN_ERR ,
" not enough memory for %d flex group pointers " , size ) ;
2012-09-05 09:29:50 +04:00
return - ENOMEM ;
}
2020-02-19 06:08:51 +03:00
for ( i = sbi - > s_flex_groups_allocated ; i < size ; i + + ) {
new_groups [ i ] = kvzalloc ( roundup_pow_of_two (
sizeof ( struct flex_groups ) ) ,
GFP_KERNEL ) ;
if ( ! new_groups [ i ] ) {
2020-02-28 12:22:56 +03:00
for ( j = sbi - > s_flex_groups_allocated ; j < i ; j + + )
kvfree ( new_groups [ j ] ) ;
2020-02-19 06:08:51 +03:00
kvfree ( new_groups ) ;
ext4_msg ( sb , KERN_ERR ,
" not enough memory for %d flex groups " , size ) ;
return - ENOMEM ;
}
2012-09-05 09:29:50 +04:00
}
2020-02-19 06:08:51 +03:00
rcu_read_lock ( ) ;
old_groups = rcu_dereference ( sbi - > s_flex_groups ) ;
if ( old_groups )
memcpy ( new_groups , old_groups ,
( sbi - > s_flex_groups_allocated *
sizeof ( struct flex_groups * ) ) ) ;
rcu_read_unlock ( ) ;
rcu_assign_pointer ( sbi - > s_flex_groups , new_groups ) ;
sbi - > s_flex_groups_allocated = size ;
if ( old_groups )
ext4_kvfree_array_rcu ( old_groups ) ;
2012-09-05 09:29:50 +04:00
return 0 ;
}
2008-07-12 03:27:31 +04:00
static int ext4_fill_flex_info ( struct super_block * sb )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_desc * gdp = NULL ;
2020-02-19 06:08:51 +03:00
struct flex_groups * fg ;
2008-07-12 03:27:31 +04:00
ext4_group_t flex_group ;
2012-09-05 09:29:50 +04:00
int i , err ;
2008-07-12 03:27:31 +04:00
2009-11-23 15:24:46 +03:00
sbi - > s_log_groups_per_flex = sbi - > s_es - > s_log_groups_per_flex ;
ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.
sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex < 2) { ... }
This patch fixes two potential issues in the previous commit.
1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount. That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0. This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.
2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways. Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex == 0 || groups_per_flex == 1) {
We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original. GCC keeps the check, but
there is no guarantee that future versions will do the same.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 20:51:10 +04:00
if ( sbi - > s_log_groups_per_flex < 1 | | sbi - > s_log_groups_per_flex > 31 ) {
2008-07-12 03:27:31 +04:00
sbi - > s_log_groups_per_flex = 0 ;
return 1 ;
}
2012-09-05 09:29:50 +04:00
err = ext4_alloc_flex_bg_array ( sb , sbi - > s_groups_count ) ;
if ( err )
2011-08-01 16:45:02 +04:00
goto failed ;
2008-07-12 03:27:31 +04:00
for ( i = 0 ; i < sbi - > s_groups_count ; i + + ) {
2009-05-25 19:50:39 +04:00
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2008-07-12 03:27:31 +04:00
flex_group = ext4_flex_group ( sbi , i ) ;
2020-02-19 06:08:51 +03:00
fg = sbi_array_rcu_deref ( sbi , s_flex_groups , flex_group ) ;
atomic_add ( ext4_free_inodes_count ( sb , gdp ) , & fg - > free_inodes ) ;
2013-03-12 07:39:59 +04:00
atomic64_add ( ext4_free_group_clusters ( sb , gdp ) ,
2020-02-19 06:08:51 +03:00
& fg - > free_clusters ) ;
atomic_add ( ext4_used_dirs_count ( sb , gdp ) , & fg - > used_dirs ) ;
2008-07-12 03:27:31 +04:00
}
return 1 ;
failed :
return 0 ;
}
2015-10-17 23:18:43 +03:00
static __le16 ext4_group_desc_csum ( struct super_block * sb , __u32 block_group ,
2012-04-30 02:45:10 +04:00
struct ext4_group_desc * gdp )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
{
2016-07-04 00:51:39 +03:00
int offset = offsetof ( struct ext4_group_desc , bg_checksum ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
__u16 crc = 0 ;
2012-04-30 02:45:10 +04:00
__le32 le_group = cpu_to_le32 ( block_group ) ;
2015-10-17 23:18:43 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
2014-10-13 11:36:16 +04:00
if ( ext4_has_metadata_csum ( sbi - > s_sb ) ) {
2012-04-30 02:45:10 +04:00
/* Use new metadata_csum algorithm */
__u32 csum32 ;
2016-07-04 00:51:39 +03:00
__u16 dummy_csum = 0 ;
2012-04-30 02:45:10 +04:00
csum32 = ext4_chksum ( sbi , sbi - > s_csum_seed , ( __u8 * ) & le_group ,
sizeof ( le_group ) ) ;
2016-07-04 00:51:39 +03:00
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) gdp , offset ) ;
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) & dummy_csum ,
sizeof ( dummy_csum ) ) ;
offset + = sizeof ( dummy_csum ) ;
if ( offset < sbi - > s_desc_size )
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) gdp + offset ,
sbi - > s_desc_size - offset ) ;
2012-04-30 02:45:10 +04:00
crc = csum32 & 0xFFFF ;
goto out ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
}
2012-04-30 02:45:10 +04:00
/* old crc16 code */
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_gdt_csum ( sb ) )
2014-10-14 10:35:49 +04:00
return 0 ;
2012-04-30 02:45:10 +04:00
crc = crc16 ( ~ 0 , sbi - > s_es - > s_uuid , sizeof ( sbi - > s_es - > s_uuid ) ) ;
crc = crc16 ( crc , ( __u8 * ) & le_group , sizeof ( le_group ) ) ;
crc = crc16 ( crc , ( __u8 * ) gdp , offset ) ;
offset + = sizeof ( gdp - > bg_checksum ) ; /* skip checksum */
/* for checksum of struct ext4_group_desc do the rest...*/
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) & &
2012-04-30 02:45:10 +04:00
offset < le16_to_cpu ( sbi - > s_es - > s_desc_size ) )
crc = crc16 ( crc , ( __u8 * ) gdp + offset ,
le16_to_cpu ( sbi - > s_es - > s_desc_size ) -
offset ) ;
out :
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
return cpu_to_le16 ( crc ) ;
}
2012-04-30 02:45:10 +04:00
int ext4_group_desc_csum_verify ( struct super_block * sb , __u32 block_group ,
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
struct ext4_group_desc * gdp )
{
2012-04-30 02:45:10 +04:00
if ( ext4_has_group_desc_csum ( sb ) & &
2015-10-17 23:18:43 +03:00
( gdp - > bg_checksum ! = ext4_group_desc_csum ( sb , block_group , gdp ) ) )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
return 0 ;
return 1 ;
}
2012-04-30 02:45:10 +04:00
void ext4_group_desc_csum_set ( struct super_block * sb , __u32 block_group ,
struct ext4_group_desc * gdp )
{
if ( ! ext4_has_group_desc_csum ( sb ) )
return ;
2015-10-17 23:18:43 +03:00
gdp - > bg_checksum = ext4_group_desc_csum ( sb , block_group , gdp ) ;
2012-04-30 02:45:10 +04:00
}
2006-10-11 12:20:50 +04:00
/* Called at mount-time, super-block is locked */
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
static int ext4_check_descriptors ( struct super_block * sb ,
2016-08-01 07:51:02 +03:00
ext4_fsblk_t sb_block ,
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t * first_not_zeroed )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
ext4_fsblk_t first_block = le32_to_cpu ( sbi - > s_es - > s_first_data_block ) ;
ext4_fsblk_t last_block ;
2018-07-09 02:35:02 +03:00
ext4_fsblk_t last_bg_block = sb_block + ext4_bg_num_gdb ( sb , 0 ) ;
2006-10-11 12:21:10 +04:00
ext4_fsblk_t block_bitmap ;
ext4_fsblk_t inode_bitmap ;
ext4_fsblk_t inode_table ;
2007-10-17 02:38:25 +04:00
int flexbg_flag = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t i , grp = sbi - > s_groups_count ;
2006-10-11 12:20:50 +04:00
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_flex_bg ( sb ) )
2007-10-17 02:38:25 +04:00
flexbg_flag = 1 ;
2008-09-09 06:25:24 +04:00
ext4_debug ( " Checking group descriptors " ) ;
2006-10-11 12:20:50 +04:00
2008-02-06 12:40:16 +03:00
for ( i = 0 ; i < sbi - > s_groups_count ; i + + ) {
struct ext4_group_desc * gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2007-10-17 02:38:25 +04:00
if ( i = = sbi - > s_groups_count - 1 | | flexbg_flag )
2006-10-11 12:21:10 +04:00
last_block = ext4_blocks_count ( sbi - > s_es ) - 1 ;
2006-10-11 12:20:50 +04:00
else
last_block = first_block +
2006-10-11 12:20:53 +04:00
( EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ) ;
2006-10-11 12:20:50 +04:00
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( ( grp = = sbi - > s_groups_count ) & &
! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
grp = i ;
2006-10-11 12:21:15 +04:00
block_bitmap = ext4_block_bitmap ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( block_bitmap = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Block bitmap for group %u overlaps "
" superblock " , i ) ;
2018-03-30 05:10:35 +03:00
if ( ! sb_rdonly ( sb ) )
return 0 ;
2016-08-01 07:51:02 +03:00
}
2018-06-14 06:08:26 +03:00
if ( block_bitmap > = sb_block + 1 & &
block_bitmap < = last_bg_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Block bitmap for group %u overlaps "
" block group descriptors " , i ) ;
if ( ! sb_rdonly ( sb ) )
return 0 ;
}
2008-07-27 00:15:44 +04:00
if ( block_bitmap < first_block | | block_bitmap > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Block bitmap for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , block_bitmap ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2006-10-11 12:21:15 +04:00
inode_bitmap = ext4_inode_bitmap ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( inode_bitmap = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode bitmap for group %u overlaps "
" superblock " , i ) ;
2018-03-30 05:10:35 +03:00
if ( ! sb_rdonly ( sb ) )
return 0 ;
2016-08-01 07:51:02 +03:00
}
2018-06-14 06:08:26 +03:00
if ( inode_bitmap > = sb_block + 1 & &
inode_bitmap < = last_bg_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode bitmap for group %u overlaps "
" block group descriptors " , i ) ;
if ( ! sb_rdonly ( sb ) )
return 0 ;
}
2008-07-27 00:15:44 +04:00
if ( inode_bitmap < first_block | | inode_bitmap > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Inode bitmap for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , inode_bitmap ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2006-10-11 12:21:15 +04:00
inode_table = ext4_inode_table ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( inode_table = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode table for group %u overlaps "
" superblock " , i ) ;
2018-03-30 05:10:35 +03:00
if ( ! sb_rdonly ( sb ) )
return 0 ;
2016-08-01 07:51:02 +03:00
}
2018-06-14 06:08:26 +03:00
if ( inode_table > = sb_block + 1 & &
inode_table < = last_bg_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode table for group %u overlaps "
" block group descriptors " , i ) ;
if ( ! sb_rdonly ( sb ) )
return 0 ;
}
2006-10-11 12:21:10 +04:00
if ( inode_table < first_block | |
2008-07-27 00:15:44 +04:00
inode_table + sbi - > s_itb_per_group - 1 > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Inode table for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , inode_table ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2009-05-03 04:35:09 +04:00
ext4_lock_group ( sb , i ) ;
2012-04-30 02:45:10 +04:00
if ( ! ext4_group_desc_csum_verify ( sb , i , gdp ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Checksum for group %u failed (%u!=%u) " ,
2015-10-17 23:18:43 +03:00
i , le16_to_cpu ( ext4_group_desc_csum ( sb , i ,
2009-06-05 01:36:36 +04:00
gdp ) ) , le16_to_cpu ( gdp - > bg_checksum ) ) ;
2017-07-17 10:45:34 +03:00
if ( ! sb_rdonly ( sb ) ) {
2009-05-03 04:35:09 +04:00
ext4_unlock_group ( sb , i ) ;
2008-07-26 22:34:21 +04:00
return 0 ;
2008-09-08 18:47:19 +04:00
}
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
}
2009-05-03 04:35:09 +04:00
ext4_unlock_group ( sb , i ) ;
2007-10-17 02:38:25 +04:00
if ( ! flexbg_flag )
first_block + = EXT4_BLOCKS_PER_GROUP ( sb ) ;
2006-10-11 12:20:50 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( NULL ! = first_not_zeroed )
* first_not_zeroed = grp ;
2006-10-11 12:20:50 +04:00
return 1 ;
}
2008-01-29 07:58:27 +03:00
/*
* Maximal extent format file size .
* Resulting logical blkno at s_maxbytes must fit in our on - disk
* extent format containers , within a sector_t , and within i_blocks
* in the vfs . ext4 inode has 48 bits of i_block in fsblock units ,
* so that won ' t be a limiting factor .
*
2011-06-06 08:05:17 +04:00
* However there is other limiting factor . We do store extents in the form
* of starting block and length , hence the resulting length of the extent
* covering maximum file size must fit into on - disk format containers as
* well . Given that length is always by 1 unit bigger than max unit ( because
* we count 0 as well ) we have to lower the s_maxbytes by one fs block .
*
2008-01-29 07:58:27 +03:00
* Note , this does * not * consider any metadata overhead for vfs i_blocks .
*/
2008-10-17 06:50:48 +04:00
static loff_t ext4_max_size ( int blkbits , int has_huge_files )
2008-01-29 07:58:27 +03:00
{
loff_t res ;
loff_t upper_limit = MAX_LFS_FILESIZE ;
2019-04-05 19:08:59 +03:00
BUILD_BUG_ON ( sizeof ( blkcnt_t ) < sizeof ( u64 ) ) ;
if ( ! has_huge_files ) {
2008-01-29 07:58:27 +03:00
upper_limit = ( 1LL < < 32 ) - 1 ;
/* total blocks in file system block size */
upper_limit > > = ( blkbits - 9 ) ;
upper_limit < < = blkbits ;
}
2011-06-06 08:05:17 +04:00
/*
* 32 - bit extent - start container , ee_block . We lower the maxbytes
* by one fs block , so ee_len can cover the extent of maximum file
* size
*/
res = ( 1LL < < 32 ) - 1 ;
2008-01-29 07:58:27 +03:00
res < < = blkbits ;
/* Sanity check against vm- & vfs- imposed limits */
if ( res > upper_limit )
res = upper_limit ;
return res ;
}
2006-10-11 12:20:50 +04:00
/*
2008-01-29 07:58:27 +03:00
* Maximal bitmap file size . There is a direct , and { , double - , triple - } indirect
2008-01-29 07:58:26 +03:00
* block limit , and also a limit of ( 2 ^ 48 - 1 ) 512 - byte sectors in i_blocks .
* We need to be 1 filesystem block less than the 2 ^ 48 sector limit .
2006-10-11 12:20:50 +04:00
*/
2008-10-17 06:50:48 +04:00
static loff_t ext4_max_bitmap_size ( int bits , int has_huge_files )
2006-10-11 12:20:50 +04:00
{
2022-03-01 14:17:04 +03:00
loff_t upper_limit , res = EXT4_NDIR_BLOCKS ;
2008-01-29 07:58:26 +03:00
int meta_blocks ;
2022-03-01 14:17:04 +03:00
unsigned int ppb = 1 < < ( bits - 2 ) ;
2021-06-05 08:09:32 +03:00
/*
* This is calculated to be the largest file size for a dense , block
2009-06-04 01:59:28 +04:00
* mapped file such that the file ' s total number of 512 - byte sectors ,
* including data and all indirect blocks , does not exceed ( 2 ^ 48 - 1 ) .
*
* __u32 i_blocks_lo and _u16 i_blocks_high represent the total
* number of 512 - byte sectors of the file .
2008-01-29 07:58:26 +03:00
*/
2019-04-05 19:08:59 +03:00
if ( ! has_huge_files ) {
2008-01-29 07:58:26 +03:00
/*
2019-04-05 19:08:59 +03:00
* ! has_huge_files or implies that the inode i_block field
* represents total file blocks in 2 ^ 32 512 - byte sectors = =
* size of vfs inode i_blocks * 8
2008-01-29 07:58:26 +03:00
*/
upper_limit = ( 1LL < < 32 ) - 1 ;
/* total blocks in file system block size */
upper_limit > > = ( bits - 9 ) ;
} else {
2008-01-29 07:58:27 +03:00
/*
* We use 48 bit ext4_inode i_blocks
* With EXT4_HUGE_FILE_FL set the i_blocks
* represent total number of blocks in
* file system block size
*/
2008-01-29 07:58:26 +03:00
upper_limit = ( 1LL < < 48 ) - 1 ;
}
2022-03-01 14:17:04 +03:00
/* Compute how many blocks we can address by block tree */
res + = ppb ;
res + = ppb * ppb ;
res + = ( ( loff_t ) ppb ) * ppb * ppb ;
/* Compute how many metadata blocks are needed */
meta_blocks = 1 ;
meta_blocks + = 1 + ppb ;
meta_blocks + = 1 + ppb + ppb * ppb ;
/* Does block tree limit file size? */
if ( res + meta_blocks < = upper_limit )
goto check_lfs ;
res = upper_limit ;
/* How many metadata blocks are needed for addressing upper_limit? */
upper_limit - = EXT4_NDIR_BLOCKS ;
2008-01-29 07:58:26 +03:00
/* indirect blocks */
meta_blocks = 1 ;
2022-03-01 14:17:04 +03:00
upper_limit - = ppb ;
2008-01-29 07:58:26 +03:00
/* double indirect blocks */
2022-03-01 14:17:04 +03:00
if ( upper_limit < ppb * ppb ) {
meta_blocks + = 1 + DIV_ROUND_UP_ULL ( upper_limit , ppb ) ;
res - = meta_blocks ;
goto check_lfs ;
}
meta_blocks + = 1 + ppb ;
upper_limit - = ppb * ppb ;
/* tripple indirect blocks for the rest */
meta_blocks + = 1 + DIV_ROUND_UP_ULL ( upper_limit , ppb ) +
DIV_ROUND_UP_ULL ( upper_limit , ppb * ppb ) ;
res - = meta_blocks ;
check_lfs :
2006-10-11 12:20:50 +04:00
res < < = bits ;
2008-01-29 07:58:26 +03:00
if ( res > MAX_LFS_FILESIZE )
res = MAX_LFS_FILESIZE ;
2022-03-01 14:17:04 +03:00
return res ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static ext4_fsblk_t descriptor_loc ( struct super_block * sb ,
2009-06-04 01:59:28 +04:00
ext4_fsblk_t logical_sb_block , int nr )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2008-01-29 07:58:27 +03:00
ext4_group_t bg , first_meta_bg ;
2006-10-11 12:20:50 +04:00
int has_super = 0 ;
first_meta_bg = le32_to_cpu ( sbi - > s_es - > s_first_meta_bg ) ;
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_meta_bg ( sb ) | | nr < first_meta_bg )
2006-10-11 12:21:20 +04:00
return logical_sb_block + nr + 1 ;
2006-10-11 12:20:50 +04:00
bg = sbi - > s_desc_per_block * nr ;
2006-10-11 12:20:53 +04:00
if ( ext4_bg_has_super ( sb , bg ) )
2006-10-11 12:20:50 +04:00
has_super = 1 ;
2009-06-04 01:59:28 +04:00
2014-05-12 18:06:27 +04:00
/*
* If we have a meta_bg fs with 1 k blocks , group 0 ' s GDT is at
* block 2 , not 1. If s_first_data_block = = 0 ( bigalloc is enabled
* on modern mke2fs or blksize > 1 k on older mke2fs ) then we must
* compensate .
*/
if ( sb - > s_blocksize = = 1024 & & nr = = 0 & &
2018-01-11 21:17:49 +03:00
le32_to_cpu ( sbi - > s_es - > s_first_data_block ) = = 0 )
2014-05-12 18:06:27 +04:00
has_super + + ;
2006-10-11 12:20:53 +04:00
return ( has_super + ext4_group_first_block_no ( sb , bg ) ) ;
2006-10-11 12:20:50 +04:00
}
2008-01-29 08:19:52 +03:00
/**
* ext4_get_stripe_size : Get the stripe size .
* @ sbi : In memory super block info
*
* If we have specified it via mount option , then
* use the mount option value . If the value specified at mount time is
* greater than the blocks per group use the super block value .
* If the super block value is greater than blocks per group return 0.
* Allocator needs it be less than blocks per group .
*
*/
static unsigned long ext4_get_stripe_size ( struct ext4_sb_info * sbi )
{
unsigned long stride = le16_to_cpu ( sbi - > s_es - > s_raid_stride ) ;
unsigned long stripe_width =
le32_to_cpu ( sbi - > s_es - > s_raid_stripe_width ) ;
2011-07-18 05:18:51 +04:00
int ret ;
2008-01-29 08:19:52 +03:00
if ( sbi - > s_stripe & & sbi - > s_stripe < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = sbi - > s_stripe ;
2017-02-10 08:56:09 +03:00
else if ( stripe_width & & stripe_width < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = stripe_width ;
2017-02-10 08:56:09 +03:00
else if ( stride & & stride < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = stride ;
else
ret = 0 ;
2008-01-29 08:19:52 +03:00
2011-07-18 05:18:51 +04:00
/*
* If the stripe width is 1 , this makes no sense and
* we set it to 0 to turn off stripe handling code .
*/
if ( ret < = 1 )
ret = 0 ;
2008-01-29 08:19:52 +03:00
2011-07-18 05:18:51 +04:00
return ret ;
2008-01-29 08:19:52 +03:00
}
2006-10-11 12:20:50 +04:00
2009-08-18 08:20:23 +04:00
/*
* Check whether this filesystem can be mounted based on
* the features present and the RDONLY / RDWR mount requested .
* Returns 1 if this filesystem can be mounted as requested ,
* 0 if it cannot be .
*/
2021-08-16 12:57:05 +03:00
int ext4_feature_set_ok ( struct super_block * sb , int readonly )
2009-08-18 08:20:23 +04:00
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext4_incompat_features ( sb ) ) {
2009-08-18 08:20:23 +04:00
ext4_msg ( sb , KERN_ERR ,
" Couldn't mount because of "
" unsupported optional features (%x) " ,
( le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_feature_incompat ) &
~ EXT4_FEATURE_INCOMPAT_SUPP ) ) ;
return 0 ;
}
2022-01-18 09:56:14 +03:00
# if !IS_ENABLED(CONFIG_UNICODE)
2019-04-25 21:05:42 +03:00
if ( ext4_has_feature_casefold ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" Filesystem with casefold feature cannot be "
" mounted without CONFIG_UNICODE " ) ;
return 0 ;
}
# endif
2009-08-18 08:20:23 +04:00
if ( readonly )
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_readonly ( sb ) ) {
2015-02-13 06:31:21 +03:00
ext4_msg ( sb , KERN_INFO , " filesystem is read-only " ) ;
2017-11-28 00:05:09 +03:00
sb - > s_flags | = SB_RDONLY ;
2015-02-13 06:31:21 +03:00
return 1 ;
}
2009-08-18 08:20:23 +04:00
/* Check that feature set is OK for a read-write mount */
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext4_ro_compat_features ( sb ) ) {
2009-08-18 08:20:23 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't mount RDWR because of "
" unsupported optional features (%x) " ,
( le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_feature_ro_compat ) &
~ EXT4_FEATURE_RO_COMPAT_SUPP ) ) ;
return 0 ;
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_bigalloc ( sb ) & & ! ext4_has_feature_extents ( sb ) ) {
2011-09-10 02:36:51 +04:00
ext4_msg ( sb , KERN_ERR ,
" Can't support bigalloc feature without "
" extents feature \n " ) ;
return 0 ;
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
2020-02-21 13:08:35 +03:00
# if !IS_ENABLED(CONFIG_QUOTA) || !IS_ENABLED(CONFIG_QFMT_V2)
2020-02-15 02:11:19 +03:00
if ( ! readonly & & ( ext4_has_feature_quota ( sb ) | |
ext4_has_feature_project ( sb ) ) ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
ext4_msg ( sb , KERN_ERR ,
2020-02-15 02:11:19 +03:00
" The kernel was not built with CONFIG_QUOTA and CONFIG_QFMT_V2 " ) ;
2016-01-09 00:01:22 +03:00
return 0 ;
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
# endif /* CONFIG_QUOTA */
2009-08-18 08:20:23 +04:00
return 1 ;
}
2010-07-27 19:56:04 +04:00
/*
* This function is called once a day if we have errors logged
* on the file system
*/
2017-10-18 19:45:17 +03:00
static void print_daily_error_info ( struct timer_list * t )
2010-07-27 19:56:04 +04:00
{
2017-10-18 19:45:17 +03:00
struct ext4_sb_info * sbi = from_timer ( sbi , t , s_err_report ) ;
struct super_block * sb = sbi - > s_sb ;
struct ext4_super_block * es = sbi - > s_es ;
2010-07-27 19:56:04 +04:00
if ( es - > s_error_count )
2014-07-06 02:40:52 +04:00
/* fsck newer than v1.41.13 is needed to clean this condition. */
ext4_msg ( sb , KERN_NOTICE , " error count since last fsck: %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_error_count ) ) ;
if ( es - > s_first_error_time ) {
2018-07-29 22:51:48 +03:00
printk ( KERN_NOTICE " EXT4-fs (%s): initial error at time %llu: %.*s:%d " ,
sb - > s_id ,
ext4_get_tstamp ( es , s_first_error_time ) ,
2010-07-27 19:56:04 +04:00
( int ) sizeof ( es - > s_first_error_func ) ,
es - > s_first_error_func ,
le32_to_cpu ( es - > s_first_error_line ) ) ;
if ( es - > s_first_error_ino )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : inode %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_first_error_ino ) ) ;
if ( es - > s_first_error_block )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : block %llu " , ( unsigned long long )
2010-07-27 19:56:04 +04:00
le64_to_cpu ( es - > s_first_error_block ) ) ;
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " \n " ) ;
2010-07-27 19:56:04 +04:00
}
if ( es - > s_last_error_time ) {
2018-07-29 22:51:48 +03:00
printk ( KERN_NOTICE " EXT4-fs (%s): last error at time %llu: %.*s:%d " ,
sb - > s_id ,
ext4_get_tstamp ( es , s_last_error_time ) ,
2010-07-27 19:56:04 +04:00
( int ) sizeof ( es - > s_last_error_func ) ,
es - > s_last_error_func ,
le32_to_cpu ( es - > s_last_error_line ) ) ;
if ( es - > s_last_error_ino )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : inode %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_last_error_ino ) ) ;
if ( es - > s_last_error_block )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : block %llu " , ( unsigned long long )
2010-07-27 19:56:04 +04:00
le64_to_cpu ( es - > s_last_error_block ) ) ;
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " \n " ) ;
2010-07-27 19:56:04 +04:00
}
mod_timer ( & sbi - > s_err_report , jiffies + 24 * 60 * 60 * HZ ) ; /* Once a day */
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/* Find next suitable group and run ext4_init_inode_table */
static int ext4_run_li_request ( struct ext4_li_request * elr )
{
struct ext4_group_desc * gdp = NULL ;
2020-07-17 07:14:40 +03:00
struct super_block * sb = elr - > lr_super ;
ext4_group_t ngroups = EXT4_SB ( sb ) - > s_groups_count ;
ext4_group_t group = elr - > lr_next_group ;
unsigned int prefetch_ios = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
int ret = 0 ;
2021-09-02 19:44:12 +03:00
u64 start_time ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2020-07-17 07:14:40 +03:00
if ( elr - > lr_mode = = EXT4_LI_MODE_PREFETCH_BBITMAP ) {
elr - > lr_next_group = ext4_mb_prefetch ( sb , group ,
EXT4_SB ( sb ) - > s_mb_prefetch , & prefetch_ios ) ;
if ( prefetch_ios )
ext4_mb_prefetch_fini ( sb , elr - > lr_next_group ,
prefetch_ios ) ;
trace_ext4_prefetch_bitmaps ( sb , group , elr - > lr_next_group ,
prefetch_ios ) ;
if ( group > = elr - > lr_next_group ) {
ret = 1 ;
if ( elr - > lr_first_not_zeroed ! = ngroups & &
! sb_rdonly ( sb ) & & test_opt ( sb , INIT_INODE_TABLE ) ) {
elr - > lr_next_group = elr - > lr_first_not_zeroed ;
elr - > lr_mode = EXT4_LI_MODE_ITABLE ;
ret = 0 ;
}
}
return ret ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2020-07-17 07:14:40 +03:00
for ( ; group < ngroups ; group + + ) {
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
gdp = ext4_get_group_desc ( sb , group , NULL ) ;
if ( ! gdp ) {
ret = 1 ;
break ;
}
if ( ! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
break ;
}
2013-01-13 17:41:45 +04:00
if ( group > = ngroups )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ret = 1 ;
if ( ! ret ) {
2021-09-02 19:44:12 +03:00
start_time = ktime_get_real_ns ( ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ret = ext4_init_inode_table ( sb , group ,
elr - > lr_timeout ? 0 : 1 ) ;
2020-07-17 07:14:40 +03:00
trace_ext4_lazy_itable_init ( sb , group ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( elr - > lr_timeout = = 0 ) {
2021-09-02 19:44:12 +03:00
elr - > lr_timeout = nsecs_to_jiffies ( ( ktime_get_real_ns ( ) - start_time ) *
EXT4_SB ( elr - > lr_super ) - > s_li_wait_mult ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
elr - > lr_next_sched = jiffies + elr - > lr_timeout ;
elr - > lr_next_group = group + 1 ;
}
return ret ;
}
/*
* Remove lr_request from the list_request and free the
2011-05-20 21:49:04 +04:00
* request structure . Should be called with li_list_mtx held
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
*/
static void ext4_remove_li_request ( struct ext4_li_request * elr )
{
if ( ! elr )
return ;
list_del ( & elr - > lr_request ) ;
2020-07-17 07:14:40 +03:00
EXT4_SB ( elr - > lr_super ) - > s_li_request = NULL ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
kfree ( elr ) ;
}
static void ext4_unregister_li_request ( struct super_block * sb )
{
2011-05-20 21:55:29 +04:00
mutex_lock ( & ext4_li_mtx ) ;
if ( ! ext4_li_info ) {
mutex_unlock ( & ext4_li_mtx ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return ;
2011-05-20 21:55:29 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
2011-05-20 21:55:29 +04:00
ext4_remove_li_request ( EXT4_SB ( sb ) - > s_li_request ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
2011-05-20 21:55:29 +04:00
mutex_unlock ( & ext4_li_mtx ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
2011-02-03 22:33:15 +03:00
static struct task_struct * ext4_lazyinit_task ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/*
* This is the function where ext4lazyinit thread lives . It walks
* through the request list searching for next scheduled filesystem .
* When such a fs is found , run the lazy initialization request
* ( ext4_rn_li_request ) and keep track of the time spend in this
* function . Based on that time we compute next schedule time of
* the request . When walking through the list is complete , compute
* next waking time and put itself into sleep .
*/
static int ext4_lazyinit_thread ( void * arg )
{
2022-04-01 11:13:21 +03:00
struct ext4_lazy_init * eli = arg ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
struct list_head * pos , * n ;
struct ext4_li_request * elr ;
2011-05-20 21:49:04 +04:00
unsigned long next_wakeup , cur ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
BUG_ON ( NULL = = eli ) ;
cont_thread :
while ( true ) {
next_wakeup = MAX_JIFFY_OFFSET ;
mutex_lock ( & eli - > li_list_mtx ) ;
if ( list_empty ( & eli - > li_request_list ) ) {
mutex_unlock ( & eli - > li_list_mtx ) ;
goto exit_thread ;
}
list_for_each_safe ( pos , n , & eli - > li_request_list ) {
2016-09-06 06:38:36 +03:00
int err = 0 ;
int progress = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
elr = list_entry ( pos , struct ext4_li_request ,
lr_request ) ;
2016-09-06 06:38:36 +03:00
if ( time_before ( jiffies , elr - > lr_next_sched ) ) {
if ( time_before ( elr - > lr_next_sched , next_wakeup ) )
next_wakeup = elr - > lr_next_sched ;
continue ;
}
if ( down_read_trylock ( & elr - > lr_super - > s_umount ) ) {
if ( sb_start_write_trylock ( elr - > lr_super ) ) {
progress = 1 ;
/*
* We hold sb - > s_umount , sb can not
* be removed from the list , it is
* now safe to drop li_list_mtx
*/
mutex_unlock ( & eli - > li_list_mtx ) ;
err = ext4_run_li_request ( elr ) ;
sb_end_write ( elr - > lr_super ) ;
mutex_lock ( & eli - > li_list_mtx ) ;
n = pos - > next ;
2010-11-02 21:19:30 +03:00
}
2016-09-06 06:38:36 +03:00
up_read ( ( & elr - > lr_super - > s_umount ) ) ;
}
/* error, remove the lazy_init job */
if ( err ) {
ext4_remove_li_request ( elr ) ;
continue ;
}
if ( ! progress ) {
elr - > lr_next_sched = jiffies +
( prandom_u32 ( )
% ( EXT4_DEF_LI_MAX_START_DELAY * HZ ) ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
if ( time_before ( elr - > lr_next_sched , next_wakeup ) )
next_wakeup = elr - > lr_next_sched ;
}
mutex_unlock ( & eli - > li_list_mtx ) ;
2011-11-22 00:32:22 +04:00
try_to_freeze ( ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2011-05-20 21:49:04 +04:00
cur = jiffies ;
if ( ( time_after_eq ( cur , next_wakeup ) ) | |
2010-11-02 21:07:17 +03:00
( MAX_JIFFY_OFFSET = = next_wakeup ) ) {
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
cond_resched ( ) ;
continue ;
}
2011-05-20 21:49:04 +04:00
schedule_timeout_interruptible ( next_wakeup - cur ) ;
2011-02-03 22:33:15 +03:00
if ( kthread_should_stop ( ) ) {
ext4_clear_request_list ( ) ;
goto exit_thread ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
exit_thread :
/*
* It looks like the request list is empty , but we need
* to check it under the li_list_mtx lock , to prevent any
* additions into it , and of course we should lock ext4_li_mtx
* to atomically free the list and ext4_li_info , because at
* this point another ext4 filesystem could be registering
* new one .
*/
mutex_lock ( & ext4_li_mtx ) ;
mutex_lock ( & eli - > li_list_mtx ) ;
if ( ! list_empty ( & eli - > li_request_list ) ) {
mutex_unlock ( & eli - > li_list_mtx ) ;
mutex_unlock ( & ext4_li_mtx ) ;
goto cont_thread ;
}
mutex_unlock ( & eli - > li_list_mtx ) ;
kfree ( ext4_li_info ) ;
ext4_li_info = NULL ;
mutex_unlock ( & ext4_li_mtx ) ;
return 0 ;
}
static void ext4_clear_request_list ( void )
{
struct list_head * pos , * n ;
struct ext4_li_request * elr ;
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
list_for_each_safe ( pos , n , & ext4_li_info - > li_request_list ) {
elr = list_entry ( pos , struct ext4_li_request ,
lr_request ) ;
ext4_remove_li_request ( elr ) ;
}
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
}
static int ext4_run_lazyinit_thread ( void )
{
2011-02-03 22:33:15 +03:00
ext4_lazyinit_task = kthread_run ( ext4_lazyinit_thread ,
ext4_li_info , " ext4lazyinit " ) ;
if ( IS_ERR ( ext4_lazyinit_task ) ) {
int err = PTR_ERR ( ext4_lazyinit_task ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_clear_request_list ( ) ;
kfree ( ext4_li_info ) ;
ext4_li_info = NULL ;
2012-03-20 07:41:49 +04:00
printk ( KERN_CRIT " EXT4-fs: error %d creating inode table "
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
" initialization thread \n " ,
err ) ;
return err ;
}
ext4_li_info - > li_state | = EXT4_LAZYINIT_RUNNING ;
return 0 ;
}
/*
* Check whether it make sense to run itable init . thread or not .
* If there is at least one uninitialized inode table , return
* corresponding group number , else the loop goes through all
* groups and return total number of groups .
*/
static ext4_group_t ext4_has_uninit_itable ( struct super_block * sb )
{
ext4_group_t group , ngroups = EXT4_SB ( sb ) - > s_groups_count ;
struct ext4_group_desc * gdp = NULL ;
2018-06-14 07:58:00 +03:00
if ( ! ext4_has_group_desc_csum ( sb ) )
return ngroups ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
for ( group = 0 ; group < ngroups ; group + + ) {
gdp = ext4_get_group_desc ( sb , group , NULL ) ;
if ( ! gdp )
continue ;
2018-07-28 15:12:04 +03:00
if ( ! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
break ;
}
return group ;
}
static int ext4_li_info_new ( void )
{
struct ext4_lazy_init * eli = NULL ;
eli = kzalloc ( sizeof ( * eli ) , GFP_KERNEL ) ;
if ( ! eli )
return - ENOMEM ;
INIT_LIST_HEAD ( & eli - > li_request_list ) ;
mutex_init ( & eli - > li_list_mtx ) ;
eli - > li_state | = EXT4_LAZYINIT_QUIT ;
ext4_li_info = eli ;
return 0 ;
}
static struct ext4_li_request * ext4_li_request_new ( struct super_block * sb ,
ext4_group_t start )
{
struct ext4_li_request * elr ;
elr = kzalloc ( sizeof ( * elr ) , GFP_KERNEL ) ;
if ( ! elr )
return NULL ;
elr - > lr_super = sb ;
2020-07-17 07:14:40 +03:00
elr - > lr_first_not_zeroed = start ;
2021-04-01 20:21:29 +03:00
if ( test_opt ( sb , NO_PREFETCH_BLOCK_BITMAPS ) ) {
2020-07-17 07:14:40 +03:00
elr - > lr_mode = EXT4_LI_MODE_ITABLE ;
elr - > lr_next_group = start ;
2021-04-01 20:21:29 +03:00
} else {
elr - > lr_mode = EXT4_LI_MODE_PREFETCH_BBITMAP ;
2020-07-17 07:14:40 +03:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/*
* Randomize first schedule time of the request to
* spread the inode table initialization requests
* better .
*/
2013-11-08 09:14:53 +04:00
elr - > lr_next_sched = jiffies + ( prandom_u32 ( ) %
( EXT4_DEF_LI_MAX_START_DELAY * HZ ) ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return elr ;
}
2013-01-13 17:41:45 +04:00
int ext4_register_li_request ( struct super_block * sb ,
ext4_group_t first_not_zeroed )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2013-01-13 17:41:45 +04:00
struct ext4_li_request * elr = NULL ;
2018-01-11 21:17:49 +03:00
ext4_group_t ngroups = sbi - > s_groups_count ;
2011-01-10 20:30:17 +03:00
int ret = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2013-01-13 17:41:45 +04:00
mutex_lock ( & ext4_li_mtx ) ;
2011-05-20 21:55:16 +04:00
if ( sbi - > s_li_request ! = NULL ) {
/*
* Reset timeout so it can be computed again , because
* s_li_wait_mult might have changed .
*/
sbi - > s_li_request - > lr_timeout = 0 ;
2013-01-13 17:41:45 +04:00
goto out ;
2011-05-20 21:55:16 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2021-04-01 20:21:29 +03:00
if ( test_opt ( sb , NO_PREFETCH_BLOCK_BITMAPS ) & &
2020-07-17 07:14:40 +03:00
( first_not_zeroed = = ngroups | | sb_rdonly ( sb ) | |
! test_opt ( sb , INIT_INODE_TABLE ) ) )
2013-01-13 17:41:45 +04:00
goto out ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
elr = ext4_li_request_new ( sb , first_not_zeroed ) ;
2013-01-13 17:41:45 +04:00
if ( ! elr ) {
ret = - ENOMEM ;
goto out ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( NULL = = ext4_li_info ) {
ret = ext4_li_info_new ( ) ;
if ( ret )
goto out ;
}
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
list_add ( & elr - > lr_request , & ext4_li_info - > li_request_list ) ;
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
sbi - > s_li_request = elr ;
2011-04-05 00:00:49 +04:00
/*
* set elr to NULL here since it has been inserted to
* the request_list and the removal and free of it is
* handled by ext4_clear_request_list from now on .
*/
elr = NULL ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( ! ( ext4_li_info - > li_state & EXT4_LAZYINIT_RUNNING ) ) {
ret = ext4_run_lazyinit_thread ( ) ;
if ( ret )
goto out ;
}
out :
2010-10-28 06:08:42 +04:00
mutex_unlock ( & ext4_li_mtx ) ;
if ( ret )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
kfree ( elr ) ;
return ret ;
}
/*
* We do not need to lock anything since this is called on
* module unload .
*/
static void ext4_destroy_lazyinit_thread ( void )
{
/*
* If thread exited earlier
* there ' s nothing to be done .
*/
2011-02-03 22:33:15 +03:00
if ( ! ext4_li_info | | ! ext4_lazyinit_task )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return ;
2011-02-03 22:33:15 +03:00
kthread_stop ( ext4_lazyinit_task ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
2012-05-27 15:48:56 +04:00
static int set_journal_csum_feature_set ( struct super_block * sb )
{
int ret = 1 ;
int compat , incompat ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2014-10-13 11:36:16 +04:00
if ( ext4_has_metadata_csum ( sb ) ) {
2014-08-28 02:40:07 +04:00
/* journal checksum v3 */
2012-05-27 15:48:56 +04:00
compat = 0 ;
2014-08-28 02:40:07 +04:00
incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3 ;
2012-05-27 15:48:56 +04:00
} else {
/* journal checksum v1 */
compat = JBD2_FEATURE_COMPAT_CHECKSUM ;
incompat = 0 ;
}
2014-09-11 19:38:21 +04:00
jbd2_journal_clear_features ( sbi - > s_journal ,
JBD2_FEATURE_COMPAT_CHECKSUM , 0 ,
JBD2_FEATURE_INCOMPAT_CSUM_V3 |
JBD2_FEATURE_INCOMPAT_CSUM_V2 ) ;
2012-05-27 15:48:56 +04:00
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ret = jbd2_journal_set_features ( sbi - > s_journal ,
compat , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
incompat ) ;
} else if ( test_opt ( sb , JOURNAL_CHECKSUM ) ) {
ret = jbd2_journal_set_features ( sbi - > s_journal ,
compat , 0 ,
incompat ) ;
jbd2_journal_clear_features ( sbi - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT ) ;
} else {
2014-09-11 19:38:21 +04:00
jbd2_journal_clear_features ( sbi - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT ) ;
2012-05-27 15:48:56 +04:00
}
return ret ;
}
2012-07-10 00:27:05 +04:00
/*
* Note : calculating the overhead so we can be compatible with
* historical BSD practice is quite difficult in the face of
* clusters / bigalloc . This is because multiple metadata blocks from
* different block group can end up in the same allocation cluster .
* Calculating the exact overhead in the face of clustered allocation
* requires either O ( all block bitmaps ) in memory or O ( number of block
* groups * * 2 ) in time . We will still calculate the superblock for
* older file systems - - - and if we come across with a bigalloc file
* system with zero in s_overhead_clusters the estimate will be close to
* correct especially for very large cluster sizes - - - but for newer
* file systems , it ' s better to calculate this figure once at mkfs
* time , and store it in the superblock . If the superblock value is
* present ( even for non - bigalloc file systems ) , we will use it .
*/
static int count_overhead ( struct super_block * sb , ext4_group_t grp ,
char * buf )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_desc * gdp ;
ext4_fsblk_t first_block , last_block , b ;
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
int s , j , count = 0 ;
2022-04-15 04:31:27 +03:00
int has_super = ext4_bg_has_super ( sb , grp ) ;
2012-07-10 00:27:05 +04:00
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_bigalloc ( sb ) )
2022-04-15 04:31:27 +03:00
return ( has_super + ext4_bg_num_gdb ( sb , grp ) +
( has_super ? le16_to_cpu ( sbi - > s_es - > s_reserved_gdt_blocks ) : 0 ) +
2012-08-16 19:59:04 +04:00
sbi - > s_itb_per_group + 2 ) ;
2012-07-10 00:27:05 +04:00
first_block = le32_to_cpu ( sbi - > s_es - > s_first_data_block ) +
( grp * EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
last_block = first_block + EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ;
for ( i = 0 ; i < ngroups ; i + + ) {
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
b = ext4_block_bitmap ( sb , gdp ) ;
if ( b > = first_block & & b < = last_block ) {
ext4_set_bit ( EXT4_B2C ( sbi , b - first_block ) , buf ) ;
count + + ;
}
b = ext4_inode_bitmap ( sb , gdp ) ;
if ( b > = first_block & & b < = last_block ) {
ext4_set_bit ( EXT4_B2C ( sbi , b - first_block ) , buf ) ;
count + + ;
}
b = ext4_inode_table ( sb , gdp ) ;
if ( b > = first_block & & b + sbi - > s_itb_per_group < = last_block )
for ( j = 0 ; j < sbi - > s_itb_per_group ; j + + , b + + ) {
int c = EXT4_B2C ( sbi , b - first_block ) ;
ext4_set_bit ( c , buf ) ;
count + + ;
}
if ( i ! = grp )
continue ;
s = 0 ;
if ( ext4_bg_has_super ( sb , grp ) ) {
ext4_set_bit ( s + + , buf ) ;
count + + ;
}
2016-11-18 21:37:47 +03:00
j = ext4_bg_num_gdb ( sb , grp ) ;
if ( s + j > EXT4_BLOCKS_PER_GROUP ( sb ) ) {
ext4_error ( sb , " Invalid number of block group "
" descriptor blocks: %d " , j ) ;
j = EXT4_BLOCKS_PER_GROUP ( sb ) - s ;
2012-07-10 00:27:05 +04:00
}
2016-11-18 21:37:47 +03:00
count + = j ;
for ( ; j > 0 ; j - - )
ext4_set_bit ( EXT4_B2C ( sbi , s + + ) , buf ) ;
2012-07-10 00:27:05 +04:00
}
if ( ! count )
return 0 ;
return EXT4_CLUSTERS_PER_GROUP ( sb ) -
ext4_count_free ( buf , EXT4_CLUSTERS_PER_GROUP ( sb ) / 8 ) ;
}
/*
* Compute the overhead and stash it in sbi - > s_overhead
*/
int ext4_calculate_overhead ( struct super_block * sb )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2016-09-30 09:08:49 +03:00
struct inode * j_inode ;
unsigned int j_blocks , j_inum = le32_to_cpu ( es - > s_journal_inum ) ;
2012-07-10 00:27:05 +04:00
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
ext4_fsblk_t overhead = 0 ;
2014-11-25 21:08:04 +03:00
char * buf = ( char * ) get_zeroed_page ( GFP_NOFS ) ;
2012-07-10 00:27:05 +04:00
if ( ! buf )
return - ENOMEM ;
/*
* Compute the overhead ( FS structures ) . This is constant
* for a given filesystem unless the number of block groups
* changes so we cache the previous value until it does .
*/
/*
* All of the blocks before first_data_block are overhead
*/
overhead = EXT4_B2C ( sbi , le32_to_cpu ( es - > s_first_data_block ) ) ;
/*
* Add the overhead found in each block group
*/
for ( i = 0 ; i < ngroups ; i + + ) {
int blks ;
blks = count_overhead ( sb , i , buf ) ;
overhead + = blks ;
if ( blks )
memset ( buf , 0 , PAGE_SIZE ) ;
cond_resched ( ) ;
}
2016-09-30 09:08:49 +03:00
/*
* Add the internal journal blocks whether the journal has been
* loaded or not
*/
2020-09-24 06:03:42 +03:00
if ( sbi - > s_journal & & ! sbi - > s_journal_bdev )
2020-11-06 06:58:54 +03:00
overhead + = EXT4_NUM_B2C ( sbi , sbi - > s_journal - > j_total_len ) ;
2020-03-16 12:30:38 +03:00
else if ( ext4_has_feature_journal ( sb ) & & ! sbi - > s_journal & & j_inum ) {
/* j_inum for internal journal is non-zero */
2016-09-30 09:08:49 +03:00
j_inode = ext4_get_journal_inode ( sb , j_inum ) ;
if ( j_inode ) {
j_blocks = j_inode - > i_size > > sb - > s_blocksize_bits ;
overhead + = EXT4_NUM_B2C ( sbi , j_blocks ) ;
iput ( j_inode ) ;
} else {
ext4_msg ( sb , KERN_ERR , " can't get journal size " ) ;
}
}
2012-07-10 00:27:05 +04:00
sbi - > s_overhead = overhead ;
smp_wmb ( ) ;
free_page ( ( unsigned long ) buf ) ;
return 0 ;
}
2015-09-23 19:44:17 +03:00
static void ext4_set_resv_clusters ( struct super_block * sb )
2013-04-10 06:11:22 +04:00
{
ext4_fsblk_t resv_clusters ;
2015-09-23 19:44:17 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2013-04-10 06:11:22 +04:00
2013-12-09 06:11:59 +04:00
/*
* There ' s no need to reserve anything when we aren ' t using extents .
* The space estimates are exact , there are no unwritten extents ,
* hole punching doesn ' t need new metadata . . . This is needed especially
* to keep ext2 / 3 backward compatibility .
*/
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_extents ( sb ) )
2015-09-23 19:44:17 +03:00
return ;
2013-04-10 06:11:22 +04:00
/*
* By default we reserve 2 % or 4096 clusters , whichever is smaller .
* This should cover the situations where we can not afford to run
* out of space like for example punch hole , or converting
2014-04-21 07:45:47 +04:00
* unwritten extents in delalloc path . In most cases such
2013-04-10 06:11:22 +04:00
* allocation would require 1 , or 2 blocks , higher numbers are
* very rare .
*/
2015-09-23 19:44:17 +03:00
resv_clusters = ( ext4_blocks_count ( sbi - > s_es ) > >
sbi - > s_cluster_bits ) ;
2013-04-10 06:11:22 +04:00
do_div ( resv_clusters , 50 ) ;
resv_clusters = min_t ( ext4_fsblk_t , resv_clusters , 4096 ) ;
2015-09-23 19:44:17 +03:00
atomic64_set ( & sbi - > s_resv_clusters , resv_clusters ) ;
2013-04-10 06:11:22 +04:00
}
2020-10-22 06:21:00 +03:00
static const char * ext4_quota_mode ( struct super_block * sb )
{
# ifdef CONFIG_QUOTA
if ( ! ext4_quota_capable ( sb ) )
return " none " ;
if ( EXT4_SB ( sb ) - > s_journal & & ext4_is_quota_journalled ( sb ) )
return " journalled " ;
else
return " writeback " ;
# else
return " disabled " ;
# endif
}
2021-08-16 12:57:04 +03:00
static void ext4_setup_csum_trigger ( struct super_block * sb ,
enum ext4_journal_trigger_type type ,
void ( * trigger ) (
struct jbd2_buffer_trigger_type * type ,
struct buffer_head * bh ,
void * mapped_data ,
size_t size ) )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
sbi - > s_journal_triggers [ type ] . sb = sb ;
sbi - > s_journal_triggers [ type ] . tr_triggers . t_frozen = trigger ;
}
2021-10-27 17:18:53 +03:00
static void ext4_free_sbi ( struct ext4_sb_info * sbi )
{
if ( ! sbi )
return ;
kfree ( sbi - > s_blockgroup_lock ) ;
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-03 08:37:25 +03:00
fs_put_dax ( sbi - > s_daxdev , NULL ) ;
2021-10-27 17:18:53 +03:00
kfree ( sbi ) ;
}
static struct ext4_sb_info * ext4_alloc_sbi ( struct super_block * sb )
{
struct ext4_sb_info * sbi ;
sbi = kzalloc ( sizeof ( * sbi ) , GFP_KERNEL ) ;
if ( ! sbi )
return NULL ;
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-03 08:37:25 +03:00
sbi - > s_daxdev = fs_dax_get_by_bdev ( sb - > s_bdev , & sbi - > s_dax_part_off ,
NULL , NULL ) ;
2021-10-27 17:18:53 +03:00
sbi - > s_blockgroup_lock =
kzalloc ( sizeof ( struct blockgroup_lock ) , GFP_KERNEL ) ;
if ( ! sbi - > s_blockgroup_lock )
goto err_out ;
sb - > s_fs_info = sbi ;
sbi - > s_sb = sb ;
return sbi ;
err_out :
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com
Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-03 08:37:25 +03:00
fs_put_dax ( sbi - > s_daxdev , NULL ) ;
2021-10-27 17:18:53 +03:00
kfree ( sbi ) ;
return NULL ;
}
2021-12-22 13:45:17 +03:00
static int __ext4_fill_super ( struct fs_context * fc , struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2020-02-16 00:40:37 +03:00
struct buffer_head * bh , * * group_desc ;
2006-10-11 12:20:53 +04:00
struct ext4_super_block * es = NULL ;
2021-10-27 17:18:53 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2020-02-19 06:08:51 +03:00
struct flex_groups * * flex_groups ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t block ;
2006-10-11 12:21:20 +04:00
ext4_fsblk_t logical_sb_block ;
2006-10-11 12:20:50 +04:00
unsigned long offset = 0 ;
unsigned long def_mount_opts ;
struct inode * root ;
2010-07-27 19:56:07 +04:00
int ret = - ENOMEM ;
2011-09-10 02:34:51 +04:00
int blocksize , clustersize ;
2009-01-06 22:53:26 +03:00
unsigned int db_count ;
unsigned int i ;
2020-04-15 10:25:42 +03:00
int needs_recovery , has_huge_files ;
2006-10-11 12:21:10 +04:00
__u64 blocks_count ;
2012-11-09 00:16:54 +04:00
int err = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t first_not_zeroed ;
2021-10-27 17:18:53 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
2021-12-22 13:45:17 +03:00
int silent = fc - > sb_flags & SB_SILENT ;
2021-04-01 20:21:24 +03:00
/* Set defaults for the variables that will be set during parsing */
2022-04-18 11:35:45 +03:00
if ( ! ( ctx - > spec & EXT4_SPEC_JOURNAL_IOPRIO ) )
ctx - > journal_ioprio = DEFAULT_JOURNAL_IOPRIO ;
2009-02-16 02:07:52 +03:00
2008-10-10 07:53:47 +04:00
sbi - > s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS ;
2020-11-24 11:36:54 +03:00
sbi - > s_sectors_written_start =
part_stat_read ( sb - > s_bdev , sectors [ STAT_WRITE ] ) ;
2006-10-11 12:20:50 +04:00
2012-11-09 00:16:54 +04:00
/* -EINVAL is default */
2010-07-27 19:56:07 +04:00
ret = - EINVAL ;
2006-10-11 12:20:53 +04:00
blocksize = sb_min_blocksize ( sb , EXT4_MIN_BLOCK_SIZE ) ;
2006-10-11 12:20:50 +04:00
if ( ! blocksize ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " unable to set blocksize " ) ;
2006-10-11 12:20:50 +04:00
goto out_fail ;
}
/*
2006-10-11 12:20:53 +04:00
* The ext4 superblock will not be buffer aligned for other than 1 kB
2006-10-11 12:20:50 +04:00
* block sizes . We need to calculate the offset from buffer start .
*/
2006-10-11 12:20:53 +04:00
if ( blocksize ! = EXT4_MIN_BLOCK_SIZE ) {
2021-10-27 17:18:53 +03:00
logical_sb_block = sbi - > s_sb_block * EXT4_MIN_BLOCK_SIZE ;
2006-10-11 12:21:20 +04:00
offset = do_div ( logical_sb_block , blocksize ) ;
2006-10-11 12:20:50 +04:00
} else {
2021-10-27 17:18:53 +03:00
logical_sb_block = sbi - > s_sb_block ;
2006-10-11 12:20:50 +04:00
}
2020-09-24 10:33:37 +03:00
bh = ext4_sb_bread_unmovable ( sb , logical_sb_block ) ;
if ( IS_ERR ( bh ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " unable to read superblock " ) ;
2020-09-24 10:33:37 +03:00
ret = PTR_ERR ( bh ) ;
2006-10-11 12:20:50 +04:00
goto out_fail ;
}
/*
* Note : s_es must be initialized as soon as possible because
2006-10-11 12:20:53 +04:00
* some ext4 macro - instructions depend on its value
2006-10-11 12:20:50 +04:00
*/
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_es = es ;
sb - > s_magic = le16_to_cpu ( es - > s_magic ) ;
2006-10-11 12:20:53 +04:00
if ( sb - > s_magic ! = EXT4_SUPER_MAGIC )
goto cantfind_ext4 ;
2009-03-01 03:39:58 +03:00
sbi - > s_kbytes_written = le64_to_cpu ( es - > s_kbytes_written ) ;
2006-10-11 12:20:50 +04:00
2012-04-30 02:45:10 +04:00
/* Warn if metadata_csum and gdt_csum are both set. */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_metadata_csum ( sb ) & &
ext4_has_feature_gdt_csum ( sb ) )
2015-01-02 23:31:14 +03:00
ext4_warning ( sb , " metadata_csum and uninit_bg are "
2012-04-30 02:45:10 +04:00
" redundant flags; please run fsck. " ) ;
2012-04-30 02:25:10 +04:00
/* Check for a known checksum algorithm */
if ( ! ext4_verify_csum_type ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " VFS: Found ext4 filesystem with "
" unknown checksum algorithm. " ) ;
silent = 1 ;
goto cantfind_ext4 ;
}
2021-08-16 12:57:06 +03:00
ext4_setup_csum_trigger ( sb , EXT4_JTR_ORPHAN_FILE ,
ext4_orphan_file_block_trigger ) ;
2012-04-30 02:25:10 +04:00
2012-04-30 02:27:10 +04:00
/* Load the checksum driver */
2018-03-30 05:10:31 +03:00
sbi - > s_chksum_driver = crypto_alloc_shash ( " crc32c " , 0 , 0 ) ;
if ( IS_ERR ( sbi - > s_chksum_driver ) ) {
ext4_msg ( sb , KERN_ERR , " Cannot load crc32c driver. " ) ;
ret = PTR_ERR ( sbi - > s_chksum_driver ) ;
sbi - > s_chksum_driver = NULL ;
goto failed_mount ;
2012-04-30 02:27:10 +04:00
}
2012-04-30 02:29:10 +04:00
/* Check superblock checksum */
if ( ! ext4_superblock_csum_verify ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " VFS: Found ext4 filesystem with "
" invalid superblock checksum. Run e2fsck? " ) ;
silent = 1 ;
2015-10-17 23:16:04 +03:00
ret = - EFSBADCRC ;
2012-04-30 02:29:10 +04:00
goto cantfind_ext4 ;
}
/* Precompute checksum seed for all metadata */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_csum_seed ( sb ) )
2015-10-17 23:16:02 +03:00
sbi - > s_csum_seed = le32_to_cpu ( es - > s_checksum_seed ) ;
2017-06-22 18:44:55 +03:00
else if ( ext4_has_metadata_csum ( sb ) | | ext4_has_feature_ea_inode ( sb ) )
2012-04-30 02:29:10 +04:00
sbi - > s_csum_seed = ext4_chksum ( sbi , ~ 0 , es - > s_uuid ,
sizeof ( es - > s_uuid ) ) ;
2006-10-11 12:20:50 +04:00
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu ( es - > s_default_mount_opts ) ;
2010-12-16 04:26:48 +03:00
set_opt ( sb , INIT_INODE_TABLE ) ;
2006-10-11 12:20:53 +04:00
if ( def_mount_opts & EXT4_DEFM_DEBUG )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DEBUG ) ;
2012-03-02 09:03:21 +04:00
if ( def_mount_opts & EXT4_DEFM_BSDGROUPS )
2010-12-16 04:26:48 +03:00
set_opt ( sb , GRPID ) ;
2006-10-11 12:20:53 +04:00
if ( def_mount_opts & EXT4_DEFM_UID16 )
2010-12-16 04:26:48 +03:00
set_opt ( sb , NO_UID32 ) ;
2011-02-24 01:51:51 +03:00
/* xattr user namespace & acls are now defaulted on */
set_opt ( sb , XATTR_USER ) ;
2008-10-11 04:02:48 +04:00
# ifdef CONFIG_EXT4_FS_POSIX_ACL
2011-02-24 01:51:51 +03:00
set_opt ( sb , POSIX_ACL ) ;
2007-02-10 12:46:13 +03:00
# endif
2020-10-15 23:37:54 +03:00
if ( ext4_has_feature_fast_commit ( sb ) )
set_opt2 ( sb , JOURNAL_FAST_COMMIT ) ;
2014-10-30 17:53:16 +03:00
/* don't forget to enable journal_csum when metadata_csum is enabled. */
if ( ext4_has_metadata_csum ( sb ) )
set_opt ( sb , JOURNAL_CHECKSUM ) ;
2006-10-11 12:20:53 +04:00
if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_DATA )
2010-12-16 04:26:48 +03:00
set_opt ( sb , JOURNAL_DATA ) ;
2006-10-11 12:20:53 +04:00
else if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_ORDERED )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ORDERED_DATA ) ;
2006-10-11 12:20:53 +04:00
else if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_WBACK )
2010-12-16 04:26:48 +03:00
set_opt ( sb , WRITEBACK_DATA ) ;
2006-10-11 12:20:53 +04:00
if ( le16_to_cpu ( sbi - > s_es - > s_errors ) = = EXT4_ERRORS_PANIC )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_PANIC ) ;
2008-01-29 07:58:26 +03:00
else if ( le16_to_cpu ( sbi - > s_es - > s_errors ) = = EXT4_ERRORS_CONTINUE )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_CONT ) ;
2008-01-29 07:58:26 +03:00
else
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_RO ) ;
2014-09-02 05:34:09 +04:00
/* block_validity enabled by default; disable with noblock_validity */
set_opt ( sb , BLOCK_VALIDITY ) ;
2010-08-02 07:14:20 +04:00
if ( def_mount_opts & EXT4_DEFM_DISCARD )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DISCARD ) ;
2006-10-11 12:20:50 +04:00
2012-02-08 03:41:49 +04:00
sbi - > s_resuid = make_kuid ( & init_user_ns , le16_to_cpu ( es - > s_def_resuid ) ) ;
sbi - > s_resgid = make_kgid ( & init_user_ns , le16_to_cpu ( es - > s_def_resgid ) ) ;
2009-01-04 04:27:38 +03:00
sbi - > s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ ;
sbi - > s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME ;
sbi - > s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME ;
2006-10-11 12:20:50 +04:00
2010-08-02 07:14:20 +04:00
if ( ( def_mount_opts & EXT4_DEFM_NOBARRIER ) = = 0 )
2010-12-16 04:26:48 +03:00
set_opt ( sb , BARRIER ) ;
2006-10-11 12:20:50 +04:00
2008-07-12 03:27:31 +04:00
/*
* enable delayed allocation by default
* Use - o nodelalloc to turn it off
*/
2012-09-18 06:54:36 +04:00
if ( ! IS_EXT3_SB ( sb ) & & ! IS_EXT2_SB ( sb ) & &
2010-08-02 07:14:20 +04:00
( ( def_mount_opts & EXT4_DEFM_NODELALLOC ) = = 0 ) )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DELALLOC ) ;
2008-07-12 03:27:31 +04:00
2011-05-20 21:55:16 +04:00
/*
* set default s_li_wait_mult for lazyinit , for the case there is
* no mount option specified .
*/
sbi - > s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT ;
2020-12-09 23:59:11 +03:00
if ( le32_to_cpu ( es - > s_log_block_size ) >
( EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE ) ) {
2020-02-07 01:35:01 +03:00
ext4_msg ( sb , KERN_ERR ,
2020-12-09 23:59:11 +03:00
" Invalid log block size: %u " ,
le32_to_cpu ( es - > s_log_block_size ) ) ;
2020-02-07 01:35:01 +03:00
goto failed_mount ;
}
2020-12-09 23:59:11 +03:00
if ( le32_to_cpu ( es - > s_log_cluster_size ) >
( EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE ) ) {
2020-02-07 01:35:01 +03:00
ext4_msg ( sb , KERN_ERR ,
2020-12-09 23:59:11 +03:00
" Invalid log cluster size: %u " ,
le32_to_cpu ( es - > s_log_cluster_size ) ) ;
2020-02-07 01:35:01 +03:00
goto failed_mount ;
}
2020-12-09 23:59:11 +03:00
blocksize = EXT4_MIN_BLOCK_SIZE < < le32_to_cpu ( es - > s_log_block_size ) ;
if ( blocksize = = PAGE_SIZE )
set_opt ( sb , DIOREAD_NOLOCK ) ;
2020-02-07 01:35:01 +03:00
2019-12-15 09:09:03 +03:00
if ( le32_to_cpu ( es - > s_rev_level ) = = EXT4_GOOD_OLD_REV ) {
sbi - > s_inode_size = EXT4_GOOD_OLD_INODE_SIZE ;
sbi - > s_first_ino = EXT4_GOOD_OLD_FIRST_INO ;
} else {
sbi - > s_inode_size = le16_to_cpu ( es - > s_inode_size ) ;
sbi - > s_first_ino = le32_to_cpu ( es - > s_first_ino ) ;
if ( sbi - > s_first_ino < EXT4_GOOD_OLD_FIRST_INO ) {
ext4_msg ( sb , KERN_ERR , " invalid first ino: %u " ,
sbi - > s_first_ino ) ;
goto failed_mount ;
}
if ( ( sbi - > s_inode_size < EXT4_GOOD_OLD_INODE_SIZE ) | |
( ! is_power_of_2 ( sbi - > s_inode_size ) ) | |
( sbi - > s_inode_size > blocksize ) ) {
ext4_msg ( sb , KERN_ERR ,
" unsupported inode size: %d " ,
sbi - > s_inode_size ) ;
2020-02-07 01:35:01 +03:00
ext4_msg ( sb , KERN_ERR , " blocksize: %d " , blocksize ) ;
2019-12-15 09:09:03 +03:00
goto failed_mount ;
}
/*
* i_atime_extra is the last extra field available for
* [ acm ] times in struct ext4_inode . Checking for that
* field should suffice to ensure we have extra space
* for all three .
*/
if ( sbi - > s_inode_size > = offsetof ( struct ext4_inode , i_atime_extra ) +
sizeof ( ( ( struct ext4_inode * ) 0 ) - > i_atime_extra ) ) {
sb - > s_time_gran = 1 ;
sb - > s_time_max = EXT4_EXTRA_TIMESTAMP_MAX ;
} else {
sb - > s_time_gran = NSEC_PER_SEC ;
sb - > s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX ;
}
sb - > s_time_min = EXT4_TIMESTAMP_MIN ;
}
if ( sbi - > s_inode_size > EXT4_GOOD_OLD_INODE_SIZE ) {
sbi - > s_want_extra_isize = sizeof ( struct ext4_inode ) -
EXT4_GOOD_OLD_INODE_SIZE ;
if ( ext4_has_feature_extra_isize ( sb ) ) {
unsigned v , max = ( sbi - > s_inode_size -
EXT4_GOOD_OLD_INODE_SIZE ) ;
v = le16_to_cpu ( es - > s_want_extra_isize ) ;
if ( v > max ) {
ext4_msg ( sb , KERN_ERR ,
" bad s_want_extra_isize: %d " , v ) ;
goto failed_mount ;
}
if ( sbi - > s_want_extra_isize < v )
sbi - > s_want_extra_isize = v ;
v = le16_to_cpu ( es - > s_min_extra_isize ) ;
if ( v > max ) {
ext4_msg ( sb , KERN_ERR ,
" bad s_min_extra_isize: %d " , v ) ;
goto failed_mount ;
}
if ( sbi - > s_want_extra_isize < v )
sbi - > s_want_extra_isize = v ;
}
}
2021-10-27 17:18:53 +03:00
err = parse_apply_sb_mount_options ( sb , ctx ) ;
if ( err < 0 )
goto failed_mount ;
2012-03-05 04:27:31 +04:00
sbi - > s_def_mount_opt = sbi - > s_mount_opt ;
2021-10-27 17:18:53 +03:00
err = ext4_check_opt_consistency ( fc , sb ) ;
if ( err < 0 )
goto failed_mount ;
2022-05-26 07:04:12 +03:00
ext4_apply_options ( fc , sb ) ;
2006-10-11 12:20:50 +04:00
2022-01-18 09:56:14 +03:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-10-28 08:08:20 +03:00
if ( ext4_has_feature_casefold ( sb ) & & ! sb - > s_encoding ) {
2019-04-25 21:05:42 +03:00
const struct ext4_sb_encodings * encoding_info ;
struct unicode_map * encoding ;
2021-09-15 09:59:56 +03:00
__u16 encoding_flags = le16_to_cpu ( es - > s_encoding_flags ) ;
2019-04-25 21:05:42 +03:00
2021-09-15 09:59:56 +03:00
encoding_info = ext4_sb_read_encoding ( es ) ;
if ( ! encoding_info ) {
2019-04-25 21:05:42 +03:00
ext4_msg ( sb , KERN_ERR ,
" Encoding requested by superblock is unknown " ) ;
goto failed_mount ;
}
encoding = utf8_load ( encoding_info - > version ) ;
if ( IS_ERR ( encoding ) ) {
ext4_msg ( sb , KERN_ERR ,
2021-09-15 10:00:00 +03:00
" can't mount with superblock charset: %s-%u.%u.%u "
2019-04-25 21:05:42 +03:00
" not supported by the kernel. flags: 0x%x. " ,
2021-09-15 10:00:00 +03:00
encoding_info - > name ,
unicode_major ( encoding_info - > version ) ,
unicode_minor ( encoding_info - > version ) ,
unicode_rev ( encoding_info - > version ) ,
2019-04-25 21:05:42 +03:00
encoding_flags ) ;
goto failed_mount ;
}
ext4_msg ( sb , KERN_INFO , " Using encoding defined by superblock: "
2021-09-15 10:00:00 +03:00
" %s-%u.%u.%u with flags 0x%hx " , encoding_info - > name ,
unicode_major ( encoding_info - > version ) ,
unicode_minor ( encoding_info - > version ) ,
unicode_rev ( encoding_info - > version ) ,
encoding_flags ) ;
2019-04-25 21:05:42 +03:00
2020-10-28 08:08:20 +03:00
sb - > s_encoding = encoding ;
sb - > s_encoding_flags = encoding_flags ;
2019-04-25 21:05:42 +03:00
}
# endif
2011-09-04 02:22:38 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA ) {
2020-11-06 06:59:07 +03:00
printk_once ( KERN_WARNING " EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support! \n " ) ;
2020-04-13 07:24:22 +03:00
/* can't mount with both data=journal and dioread_nolock. */
2020-01-23 20:23:17 +03:00
clear_opt ( sb , DIOREAD_NOLOCK ) ;
2020-11-06 06:59:07 +03:00
clear_opt2 ( sb , JOURNAL_FAST_COMMIT ) ;
2011-09-04 02:22:38 +04:00
if ( test_opt2 ( sb , EXPLICIT_DELALLOC ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and delalloc " ) ;
goto failed_mount ;
}
2020-05-28 17:59:57 +03:00
if ( test_opt ( sb , DAX_ALWAYS ) ) {
2015-02-17 02:59:38 +03:00
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and dax " ) ;
goto failed_mount ;
}
ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.
Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.
Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.
Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.
Fixes: 4461471107b7
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-11 01:54:58 +03:00
if ( ext4_has_feature_encrypt ( sb ) ) {
ext4_msg ( sb , KERN_WARNING ,
" encrypted files will use data=ordered "
" instead of data journaling mode " ) ;
}
2011-09-04 02:22:38 +04:00
if ( test_opt ( sb , DELALLOC ) )
clear_opt ( sb , DELALLOC ) ;
2015-07-22 06:51:26 +03:00
} else {
sb - > s_iflags | = SB_I_CGROUPWB ;
2011-09-04 02:22:38 +04:00
}
2017-11-28 00:05:09 +03:00
sb - > s_flags = ( sb - > s_flags & ~ SB_POSIXACL ) |
( test_opt ( sb , POSIX_ACL ) ? SB_POSIXACL : 0 ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) = = EXT4_GOOD_OLD_REV & &
2015-10-17 23:18:43 +03:00
( ext4_has_compat_features ( sb ) | |
ext4_has_ro_compat_features ( sb ) | |
ext4_has_incompat_features ( sb ) ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" feature flags set on rev 0 fs, "
" running e2fsck is recommended " ) ;
2008-02-10 09:11:44 +03:00
2014-03-24 22:09:06 +04:00
if ( es - > s_creator_os = = cpu_to_le32 ( EXT4_OS_HURD ) ) {
set_opt2 ( sb , HURD_COMPAT ) ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) ) {
2014-03-24 22:09:06 +04:00
ext4_msg ( sb , KERN_ERR ,
" The Hurd can't support 64-bit file systems " ) ;
goto failed_mount ;
}
2017-06-22 18:44:55 +03:00
/*
* ea_inode feature uses l_i_version field which is not
* available in HURD_COMPAT mode .
*/
if ( ext4_has_feature_ea_inode ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" ea_inode feature is not supported for Hurd " ) ;
goto failed_mount ;
}
2014-03-24 22:09:06 +04:00
}
2011-04-19 01:29:14 +04:00
if ( IS_EXT2_SB ( sb ) ) {
if ( ext2_feature_set_ok ( sb ) )
ext4_msg ( sb , KERN_INFO , " mounting ext2 file system "
" using the ext4 subsystem " ) ;
else {
2018-03-22 18:59:00 +03:00
/*
* If we ' re probing be silent , if this looks like
* it ' s actually an ext [ 34 ] filesystem .
*/
if ( silent & & ext4_feature_set_ok ( sb , sb_rdonly ( sb ) ) )
goto failed_mount ;
2011-04-19 01:29:14 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't mount as ext2 due "
" to feature incompatibilities " ) ;
goto failed_mount ;
}
}
if ( IS_EXT3_SB ( sb ) ) {
if ( ext3_feature_set_ok ( sb ) )
ext4_msg ( sb , KERN_INFO , " mounting ext3 file system "
" using the ext4 subsystem " ) ;
else {
2018-03-22 18:59:00 +03:00
/*
* If we ' re probing be silent , if this looks like
* it ' s actually an ext4 filesystem .
*/
if ( silent & & ext4_feature_set_ok ( sb , sb_rdonly ( sb ) ) )
goto failed_mount ;
2011-04-19 01:29:14 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't mount as ext3 due "
" to feature incompatibilities " ) ;
goto failed_mount ;
}
}
2006-10-11 12:20:50 +04:00
/*
* Check feature flags regardless of the revision level , since we
* previously didn ' t change the revision level when setting the flags ,
* so there is a chance incompat flags are set on a rev 0 filesystem .
*/
2017-07-17 10:45:34 +03:00
if ( ! ext4_feature_set_ok ( sb , ( sb_rdonly ( sb ) ) ) )
2006-10-11 12:20:50 +04:00
goto failed_mount ;
2009-08-18 08:20:23 +04:00
2016-07-06 03:01:52 +03:00
if ( le16_to_cpu ( sbi - > s_es - > s_reserved_gdt_blocks ) > ( blocksize / 4 ) ) {
ext4_msg ( sb , KERN_ERR ,
" Number of reserved GDT blocks insanely large: %d " ,
le16_to_cpu ( sbi - > s_es - > s_reserved_gdt_blocks ) ) ;
goto failed_mount ;
}
2021-11-29 13:21:54 +03:00
if ( sbi - > s_daxdev ) {
2021-11-29 13:21:42 +03:00
if ( blocksize = = PAGE_SIZE )
set_bit ( EXT4_FLAGS_BDEV_IS_DAX , & sbi - > s_ext4_flags ) ;
else
ext4_msg ( sb , KERN_ERR , " unsupported blocksize for DAX \n " ) ;
}
2020-05-28 17:59:58 +03:00
2020-05-28 17:59:57 +03:00
if ( sbi - > s_mount_opt & EXT4_MOUNT_DAX_ALWAYS ) {
2017-10-12 18:52:34 +03:00
if ( ext4_has_feature_inline_data ( sb ) ) {
ext4_msg ( sb , KERN_ERR , " Cannot use DAX on a filesystem "
" that may contain inline data " ) ;
2018-12-04 08:46:39 +03:00
goto failed_mount ;
2017-10-12 18:52:34 +03:00
}
2020-05-28 17:59:58 +03:00
if ( ! test_bit ( EXT4_FLAGS_BDEV_IS_DAX , & sbi - > s_ext4_flags ) ) {
2017-12-22 04:04:07 +03:00
ext4_msg ( sb , KERN_ERR ,
2018-12-04 08:46:39 +03:00
" DAX unsupported by block device. " ) ;
goto failed_mount ;
2017-12-22 04:04:07 +03:00
}
2015-02-17 02:59:38 +03:00
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_encrypt ( sb ) & & es - > s_encryption_level ) {
2015-04-16 08:56:00 +03:00
ext4_msg ( sb , KERN_ERR , " Unsupported encryption level %d " ,
es - > s_encryption_level ) ;
goto failed_mount ;
}
2006-10-11 12:20:50 +04:00
if ( sb - > s_blocksize ! = blocksize ) {
2021-05-21 10:55:33 +03:00
/*
* bh must be released before kill_bdev ( ) , otherwise
* it won ' t be freed and its page also . kill_bdev ( )
* is called by sb_set_blocksize ( ) .
*/
brelse ( bh ) ;
2008-01-29 07:58:27 +03:00
/* Validate the filesystem blocksize */
if ( ! sb_set_blocksize ( sb , blocksize ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " bad block size %d " ,
2008-01-29 07:58:27 +03:00
blocksize ) ;
2021-05-21 10:55:33 +03:00
bh = NULL ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2021-10-27 17:18:53 +03:00
logical_sb_block = sbi - > s_sb_block * EXT4_MIN_BLOCK_SIZE ;
2006-10-11 12:21:20 +04:00
offset = do_div ( logical_sb_block , blocksize ) ;
2020-09-24 10:33:37 +03:00
bh = ext4_sb_bread_unmovable ( sb , logical_sb_block ) ;
if ( IS_ERR ( bh ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" Can't read superblock on 2nd try " ) ;
2020-09-24 10:33:37 +03:00
ret = PTR_ERR ( bh ) ;
bh = NULL ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_es = es ;
2006-10-11 12:20:53 +04:00
if ( es - > s_magic ! = cpu_to_le16 ( EXT4_SUPER_MAGIC ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" Magic mismatch, very weird! " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
}
2015-10-17 23:18:43 +03:00
has_huge_files = ext4_has_feature_huge_file ( sb ) ;
2008-10-17 06:50:48 +04:00
sbi - > s_bitmap_maxbytes = ext4_max_bitmap_size ( sb - > s_blocksize_bits ,
has_huge_files ) ;
sb - > s_maxbytes = ext4_max_size ( sb - > s_blocksize_bits , has_huge_files ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size = le16_to_cpu ( es - > s_desc_size ) ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) ) {
2006-10-11 12:21:15 +04:00
if ( sbi - > s_desc_size < EXT4_MIN_DESC_SIZE_64BIT | |
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size > EXT4_MAX_DESC_SIZE | |
2007-10-17 10:27:14 +04:00
! is_power_of_2 ( sbi - > s_desc_size ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" unsupported descriptor size %lu " ,
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size ) ;
goto failed_mount ;
}
} else
sbi - > s_desc_size = EXT4_MIN_DESC_SIZE ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
sbi - > s_blocks_per_group = le32_to_cpu ( es - > s_blocks_per_group ) ;
sbi - > s_inodes_per_group = le32_to_cpu ( es - > s_inodes_per_group ) ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:53 +04:00
sbi - > s_inodes_per_block = blocksize / EXT4_INODE_SIZE ( sb ) ;
2006-10-11 12:20:50 +04:00
if ( sbi - > s_inodes_per_block = = 0 )
2006-10-11 12:20:53 +04:00
goto cantfind_ext4 ;
2016-11-18 21:28:30 +03:00
if ( sbi - > s_inodes_per_group < sbi - > s_inodes_per_block | |
sbi - > s_inodes_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR , " invalid inodes per group: %lu \n " ,
2020-03-29 01:34:15 +03:00
sbi - > s_inodes_per_group ) ;
2016-11-18 21:28:30 +03:00
goto failed_mount ;
}
2006-10-11 12:20:50 +04:00
sbi - > s_itb_per_group = sbi - > s_inodes_per_group /
sbi - > s_inodes_per_block ;
2006-10-11 12:21:14 +04:00
sbi - > s_desc_per_block = blocksize / EXT4_DESC_SIZE ( sb ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_sbh = bh ;
2022-05-17 20:27:55 +03:00
sbi - > s_mount_state = le16_to_cpu ( es - > s_state ) & ~ EXT4_FC_REPLAY ;
2007-10-17 10:26:25 +04:00
sbi - > s_addr_per_block_bits = ilog2 ( EXT4_ADDR_PER_BLOCK ( sb ) ) ;
sbi - > s_desc_per_block_bits = ilog2 ( EXT4_DESC_PER_BLOCK ( sb ) ) ;
2009-06-04 01:59:28 +04:00
2008-07-27 00:15:44 +04:00
for ( i = 0 ; i < 4 ; i + + )
2006-10-11 12:20:50 +04:00
sbi - > s_hash_seed [ i ] = le32_to_cpu ( es - > s_hash_seed [ i ] ) ;
sbi - > s_def_hash_version = es - > s_def_hash_version ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_dir_index ( sb ) ) {
2014-02-12 21:16:04 +04:00
i = le32_to_cpu ( es - > s_flags ) ;
if ( i & EXT2_FLAGS_UNSIGNED_HASH )
sbi - > s_hash_unsigned = 3 ;
else if ( ( i & EXT2_FLAGS_SIGNED_HASH ) = = 0 ) {
2008-10-28 20:21:44 +03:00
# ifdef __CHAR_UNSIGNED__
2017-07-17 10:45:34 +03:00
if ( ! sb_rdonly ( sb ) )
2014-02-12 21:16:04 +04:00
es - > s_flags | =
cpu_to_le32 ( EXT2_FLAGS_UNSIGNED_HASH ) ;
sbi - > s_hash_unsigned = 3 ;
2008-10-28 20:21:44 +03:00
# else
2017-07-17 10:45:34 +03:00
if ( ! sb_rdonly ( sb ) )
2014-02-12 21:16:04 +04:00
es - > s_flags | =
cpu_to_le32 ( EXT2_FLAGS_SIGNED_HASH ) ;
2008-10-28 20:21:44 +03:00
# endif
2014-02-12 21:16:04 +04:00
}
2008-10-28 20:21:44 +03:00
}
2006-10-11 12:20:50 +04:00
2011-09-10 02:34:51 +04:00
/* Handle clustersize */
clustersize = BLOCK_SIZE < < le32_to_cpu ( es - > s_log_cluster_size ) ;
2020-04-15 10:25:42 +03:00
if ( ext4_has_feature_bigalloc ( sb ) ) {
2011-09-10 02:34:51 +04:00
if ( clustersize < blocksize ) {
ext4_msg ( sb , KERN_ERR ,
" cluster size (%d) smaller than "
" block size (%d) " , clustersize , blocksize ) ;
goto failed_mount ;
}
sbi - > s_cluster_bits = le32_to_cpu ( es - > s_log_cluster_size ) -
le32_to_cpu ( es - > s_log_block_size ) ;
sbi - > s_clusters_per_group =
le32_to_cpu ( es - > s_clusters_per_group ) ;
if ( sbi - > s_clusters_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR ,
" #clusters per group too big: %lu " ,
sbi - > s_clusters_per_group ) ;
goto failed_mount ;
}
if ( sbi - > s_blocks_per_group ! =
( sbi - > s_clusters_per_group * ( clustersize / blocksize ) ) ) {
ext4_msg ( sb , KERN_ERR , " blocks per group (%lu) and "
" clusters per group (%lu) inconsistent " ,
sbi - > s_blocks_per_group ,
sbi - > s_clusters_per_group ) ;
goto failed_mount ;
}
} else {
if ( clustersize ! = blocksize ) {
2018-06-18 01:11:20 +03:00
ext4_msg ( sb , KERN_ERR ,
" fragment/cluster size (%d) != "
" block size (%d) " , clustersize , blocksize ) ;
goto failed_mount ;
2011-09-10 02:34:51 +04:00
}
if ( sbi - > s_blocks_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR ,
" #blocks per group too big: %lu " ,
sbi - > s_blocks_per_group ) ;
goto failed_mount ;
}
sbi - > s_clusters_per_group = sbi - > s_blocks_per_group ;
sbi - > s_cluster_bits = 0 ;
2006-10-11 12:20:50 +04:00
}
2011-09-10 02:34:51 +04:00
sbi - > s_cluster_ratio = clustersize / blocksize ;
2013-07-06 07:11:16 +04:00
/* Do we have standard group size of clustersize * 8 blocks ? */
if ( sbi - > s_blocks_per_group = = clustersize < < 3 )
set_opt2 ( sb , STD_GROUP_SIZE ) ;
2009-08-18 07:48:51 +04:00
/*
* Test whether we have more sectors than will fit in sector_t ,
* and whether the max offset is addressable by the page cache .
*/
2010-11-19 17:56:44 +03:00
err = generic_check_addressable ( sb - > s_blocksize_bits ,
2010-07-23 02:03:41 +04:00
ext4_blocks_count ( es ) ) ;
2010-11-19 17:56:44 +03:00
if ( err ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " filesystem "
2009-08-18 07:48:51 +04:00
" too large to mount safely on this system " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2006-10-11 12:20:53 +04:00
if ( EXT4_BLOCKS_PER_GROUP ( sb ) = = 0 )
goto cantfind_ext4 ;
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 07:58:27 +03:00
2009-04-07 22:07:47 +04:00
/* check blocks count against device size */
2021-10-18 13:11:26 +03:00
blocks_count = sb_bdev_nr_blocks ( sb ) ;
2009-04-07 22:07:47 +04:00
if ( blocks_count & & ext4_blocks_count ( es ) > blocks_count ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " bad geometry: block count %llu "
" exceeds size of device (%llu blocks) " ,
2009-04-07 22:07:47 +04:00
ext4_blocks_count ( es ) , blocks_count ) ;
goto failed_mount ;
}
2009-06-04 01:59:28 +04:00
/*
* It makes no sense for the first data block to be beyond the end
* of the filesystem .
*/
if ( le32_to_cpu ( es - > s_first_data_block ) > = ext4_blocks_count ( es ) ) {
2011-12-19 01:13:58 +04:00
ext4_msg ( sb , KERN_WARNING , " bad geometry: first data "
2009-06-05 01:36:36 +04:00
" block %u is beyond end of filesystem (%llu) " ,
le32_to_cpu ( es - > s_first_data_block ) ,
ext4_blocks_count ( es ) ) ;
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 07:58:27 +03:00
goto failed_mount ;
}
2018-06-18 01:11:20 +03:00
if ( ( es - > s_first_data_block = = 0 ) & & ( es - > s_log_block_size = = 0 ) & &
( sbi - > s_cluster_ratio = = 1 ) ) {
ext4_msg ( sb , KERN_WARNING , " bad geometry: first data "
" block is 0 with a 1k block and cluster size " ) ;
goto failed_mount ;
}
2006-10-11 12:21:10 +04:00
blocks_count = ( ext4_blocks_count ( es ) -
le32_to_cpu ( es - > s_first_data_block ) +
EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ) ;
do_div ( blocks_count , EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
2009-01-06 22:53:26 +03:00
if ( blocks_count > ( ( uint64_t ) 1 < < 32 ) - EXT4_DESC_PER_BLOCK ( sb ) ) {
2020-03-29 00:54:01 +03:00
ext4_msg ( sb , KERN_WARNING , " groups count too large: %llu "
2009-01-06 22:53:26 +03:00
" (block count %llu, first data block %u, "
2020-03-29 00:54:01 +03:00
" blocks per group %lu) " , blocks_count ,
2009-01-06 22:53:26 +03:00
ext4_blocks_count ( es ) ,
le32_to_cpu ( es - > s_first_data_block ) ,
EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
goto failed_mount ;
}
2006-10-11 12:21:10 +04:00
sbi - > s_groups_count = blocks_count ;
ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.
This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.
This should address RH Bug 519471, ext4 bitmap allocator
must limit blocks to < 2^32
* ext4_find_goal() is modified to choose a goal < UINT_MAX,
so that our starting point is in an acceptable range.
* ext4_xattr_block_set() is modified such that the goal block
is < UINT_MAX, as above.
* ext4_mb_regular_allocator() is modified so that the group
search does not continue into groups which are too high
* ext4_mb_use_preallocated() has a check that we don't use
preallocated space which is too far out
* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs
No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes. Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.
For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 22:45:10 +04:00
sbi - > s_blockfile_groups = min_t ( ext4_group_t , sbi - > s_groups_count ,
( EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP ( sb ) ) ) ;
2018-11-07 18:32:53 +03:00
if ( ( ( u64 ) sbi - > s_groups_count * sbi - > s_inodes_per_group ) ! =
le32_to_cpu ( es - > s_inodes_count ) ) {
ext4_msg ( sb , KERN_ERR , " inodes count not valid: %u vs %llu " ,
le32_to_cpu ( es - > s_inodes_count ) ,
( ( u64 ) sbi - > s_groups_count * sbi - > s_inodes_per_group ) ) ;
ret = - EINVAL ;
goto failed_mount ;
}
2006-10-11 12:20:53 +04:00
db_count = ( sbi - > s_groups_count + EXT4_DESC_PER_BLOCK ( sb ) - 1 ) /
EXT4_DESC_PER_BLOCK ( sb ) ;
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 23:08:37 +03:00
if ( ext4_has_feature_meta_bg ( sb ) ) {
2017-02-15 09:26:39 +03:00
if ( le32_to_cpu ( es - > s_first_meta_bg ) > db_count ) {
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 23:08:37 +03:00
ext4_msg ( sb , KERN_WARNING ,
" first meta block group too large: %u "
" (group descriptor block count %u) " ,
le32_to_cpu ( es - > s_first_meta_bg ) , db_count ) ;
goto failed_mount ;
}
}
2020-02-16 00:40:37 +03:00
rcu_assign_pointer ( sbi - > s_group_desc ,
kvmalloc_array ( db_count ,
sizeof ( struct buffer_head * ) ,
GFP_KERNEL ) ) ;
2006-10-11 12:20:50 +04:00
if ( sbi - > s_group_desc = = NULL ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " not enough memory " ) ;
2012-05-29 01:49:54 +04:00
ret = - ENOMEM ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2009-02-16 02:07:52 +03:00
bgl_lock_init ( sbi - > s_blockgroup_lock ) ;
2006-10-11 12:20:50 +04:00
2017-04-30 07:46:35 +03:00
/* Pre-read the descriptors into the buffer cache */
for ( i = 0 ; i < db_count ; i + + ) {
block = descriptor_loc ( sb , logical_sb_block , i ) ;
2020-09-24 10:33:35 +03:00
ext4_sb_breadahead_unmovable ( sb , block ) ;
2017-04-30 07:46:35 +03:00
}
2006-10-11 12:20:50 +04:00
for ( i = 0 ; i < db_count ; i + + ) {
2020-02-16 00:40:37 +03:00
struct buffer_head * bh ;
2006-10-11 12:21:20 +04:00
block = descriptor_loc ( sb , logical_sb_block , i ) ;
2020-09-24 10:33:37 +03:00
bh = ext4_sb_bread_unmovable ( sb , block ) ;
if ( IS_ERR ( bh ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" can't read group descriptor %d " , i ) ;
2006-10-11 12:20:50 +04:00
db_count = i ;
2020-09-24 10:33:37 +03:00
ret = PTR_ERR ( bh ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount2 ;
}
2020-02-16 00:40:37 +03:00
rcu_read_lock ( ) ;
rcu_dereference ( sbi - > s_group_desc ) [ i ] = bh ;
rcu_read_unlock ( ) ;
2006-10-11 12:20:50 +04:00
}
2018-07-09 02:35:02 +03:00
sbi - > s_gdb_count = db_count ;
2016-08-01 07:51:02 +03:00
if ( ! ext4_check_descriptors ( sb , logical_sb_block , & first_not_zeroed ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " group descriptors corrupted! " ) ;
2015-10-17 23:16:04 +03:00
ret = - EFSCORRUPTED ;
2014-07-11 21:55:40 +04:00
goto failed_mount2 ;
2006-10-11 12:20:50 +04:00
}
2008-07-12 03:27:31 +04:00
2017-10-18 19:45:17 +03:00
timer_setup ( & sbi - > s_err_report , print_daily_error_info , 0 ) ;
2020-11-27 14:34:00 +03:00
spin_lock_init ( & sbi - > s_error_lock ) ;
INIT_WORK ( & sbi - > s_error_work , flush_stashed_error_work ) ;
2011-04-06 03:55:28 +04:00
2013-04-04 06:10:52 +04:00
/* Register extent status tree shrinker */
2014-09-02 06:26:49 +04:00
if ( ext4_es_register_shrinker ( sbi ) )
2010-11-03 19:03:21 +03:00
goto failed_mount3 ;
2008-01-29 08:19:52 +03:00
sbi - > s_stripe = ext4_get_stripe_size ( sbi ) ;
2012-08-17 17:54:17 +04:00
sbi - > s_extent_max_zeroout_kb = 32 ;
2008-01-29 08:19:52 +03:00
2014-07-11 21:55:40 +04:00
/*
* set up enough so that it can read an inode
*/
2014-09-19 01:12:30 +04:00
sb - > s_op = & ext4_sops ;
2006-10-11 12:20:53 +04:00
sb - > s_export_op = & ext4_export_ops ;
sb - > s_xattr = ext4_xattr_handlers ;
2018-12-12 12:50:12 +03:00
# ifdef CONFIG_FS_ENCRYPTION
2016-07-10 21:01:03 +03:00
sb - > s_cop = & ext4_cryptops ;
2017-10-09 22:15:38 +03:00
# endif
2019-07-22 19:26:24 +03:00
# ifdef CONFIG_FS_VERITY
sb - > s_vop = & ext4_verityops ;
# endif
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2006-10-11 12:20:53 +04:00
sb - > dq_op = & ext4_quota_operations ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) )
2014-10-08 20:26:54 +04:00
sb - > s_qcop = & dquot_quotactl_sysfile_ops ;
2013-03-03 02:57:08 +04:00
else
sb - > s_qcop = & ext4_qctl_operations ;
2016-01-09 00:01:22 +03:00
sb - > s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ ;
2006-10-11 12:20:50 +04:00
# endif
2017-05-10 16:06:33 +03:00
memcpy ( & sb - > s_uuid , es - > s_uuid , sizeof ( es - > s_uuid ) ) ;
2011-01-29 16:13:40 +03:00
2006-10-11 12:20:50 +04:00
INIT_LIST_HEAD ( & sbi - > s_orphan ) ; /* unlinked but open files */
2009-04-26 06:54:04 +04:00
mutex_init ( & sbi - > s_orphan_lock ) ;
2006-10-11 12:20:50 +04:00
2020-10-15 23:37:57 +03:00
/* Initialize fast commit stuff */
atomic_set ( & sbi - > s_fc_subtid , 0 ) ;
INIT_LIST_HEAD ( & sbi - > s_fc_q [ FC_Q_MAIN ] ) ;
INIT_LIST_HEAD ( & sbi - > s_fc_q [ FC_Q_STAGING ] ) ;
INIT_LIST_HEAD ( & sbi - > s_fc_dentry_q [ FC_Q_MAIN ] ) ;
INIT_LIST_HEAD ( & sbi - > s_fc_dentry_q [ FC_Q_STAGING ] ) ;
sbi - > s_fc_bytes = 0 ;
2020-11-06 06:59:09 +03:00
ext4_clear_mount_flag ( sb , EXT4_MF_FC_INELIGIBLE ) ;
2022-01-17 12:36:54 +03:00
sbi - > s_fc_ineligible_tid = 0 ;
2020-10-15 23:37:57 +03:00
spin_lock_init ( & sbi - > s_fc_lock ) ;
memset ( & sbi - > s_fc_stats , 0 , sizeof ( sbi - > s_fc_stats ) ) ;
2020-10-15 23:37:59 +03:00
sbi - > s_fc_replay_state . fc_regions = NULL ;
sbi - > s_fc_replay_state . fc_regions_size = 0 ;
sbi - > s_fc_replay_state . fc_regions_used = 0 ;
sbi - > s_fc_replay_state . fc_regions_valid = 0 ;
sbi - > s_fc_replay_state . fc_modified_inodes = NULL ;
sbi - > s_fc_replay_state . fc_modified_inodes_size = 0 ;
sbi - > s_fc_replay_state . fc_modified_inodes_used = 0 ;
2020-10-15 23:37:57 +03:00
2006-10-11 12:20:50 +04:00
sb - > s_root = NULL ;
needs_recovery = ( es - > s_last_orphan ! = 0 | |
2021-08-16 12:57:06 +03:00
ext4_has_feature_orphan_present ( sb ) | |
2015-10-17 23:18:43 +03:00
ext4_has_feature_journal_needs_recovery ( sb ) ) ;
2006-10-11 12:20:50 +04:00
2017-07-17 10:45:34 +03:00
if ( ext4_has_feature_mmp ( sb ) & & ! sb_rdonly ( sb ) )
2011-05-25 02:31:25 +04:00
if ( ext4_multi_mount_protect ( sb , le64_to_cpu ( es - > s_mmp_block ) ) )
2014-10-30 17:53:16 +03:00
goto failed_mount3a ;
2011-05-25 02:31:25 +04:00
2006-10-11 12:20:50 +04:00
/*
* The first inode we look at is the journal inode . Don ' t try
* root first : it may be modified in the journal !
*/
2015-10-17 23:18:43 +03:00
if ( ! test_opt ( sb , NOLOAD ) & & ext4_has_feature_journal ( sb ) ) {
2021-10-27 17:18:53 +03:00
err = ext4_load_journal ( sb , es , ctx - > journal_devnum ) ;
2017-02-05 09:26:48 +03:00
if ( err )
2014-10-30 17:53:16 +03:00
goto failed_mount3a ;
2017-07-17 10:45:34 +03:00
} else if ( test_opt ( sb , NOLOAD ) & & ! sb_rdonly ( sb ) & &
2015-10-17 23:18:43 +03:00
ext4_has_feature_journal_needs_recovery ( sb ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " required journal recovery "
" suppressed and not mounted read-only " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2006-10-11 12:20:50 +04:00
} else {
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
/* Nojournal mode, all journal mount options are illegal */
if ( test_opt2 ( sb , EXPLICIT_JOURNAL_CHECKSUM ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_checksum, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
if ( sbi - > s_commit_interval ! = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" commit=%lu, fs mounted w/o journal " ,
sbi - > s_commit_interval / HZ ) ;
goto failed_mount_wq ;
}
if ( EXT4_MOUNT_DATA_FLAGS &
( sbi - > s_mount_opt ^ sbi - > s_def_mount_opt ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" data=, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
2019-05-01 06:08:15 +03:00
sbi - > s_def_mount_opt & = ~ EXT4_MOUNT_JOURNAL_CHECKSUM ;
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
clear_opt ( sb , JOURNAL_CHECKSUM ) ;
2010-12-16 04:26:48 +03:00
clear_opt ( sb , DATA_FLAGS ) ;
2020-10-15 23:37:54 +03:00
clear_opt2 ( sb , JOURNAL_FAST_COMMIT ) ;
2009-01-07 08:06:22 +03:00
sbi - > s_journal = NULL ;
needs_recovery = 0 ;
goto no_journal ;
2006-10-11 12:20:50 +04:00
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) & &
2007-07-18 16:37:25 +04:00
! jbd2_journal_set_features ( EXT4_SB ( sb ) - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_64BIT ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Failed to set 64-bit journal feature " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2007-07-18 16:37:25 +04:00
}
2012-05-27 15:48:56 +04:00
if ( ! set_journal_csum_feature_set ( sb ) ) {
ext4_msg ( sb , KERN_ERR , " Failed to set journal checksum "
" feature set " ) ;
goto failed_mount_wq ;
2009-11-02 21:15:27 +03:00
}
2008-01-29 07:58:27 +03:00
2020-11-06 06:58:55 +03:00
if ( test_opt2 ( sb , JOURNAL_FAST_COMMIT ) & &
! jbd2_journal_set_features ( EXT4_SB ( sb ) - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_FAST_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR ,
" Failed to set fast commit journal feature " ) ;
goto failed_mount_wq ;
}
2006-10-11 12:20:50 +04:00
/* We have now updated the journal if required, so we can
* validate the data journaling mode . */
switch ( test_opt ( sb , DATA_FLAGS ) ) {
case 0 :
/* No mode set, assume a default based on the journal
2006-10-11 12:21:24 +04:00
* capabilities : ORDERED_DATA if the journal can
* cope , else JOURNAL_DATA
*/
2006-10-11 12:21:01 +04:00
if ( jbd2_journal_check_available_features
2018-03-30 07:56:10 +03:00
( sbi - > s_journal , 0 , 0 , JBD2_FEATURE_INCOMPAT_REVOKE ) ) {
2010-12-16 04:26:48 +03:00
set_opt ( sb , ORDERED_DATA ) ;
2018-03-30 07:56:10 +03:00
sbi - > s_def_mount_opt | = EXT4_MOUNT_ORDERED_DATA ;
} else {
2010-12-16 04:26:48 +03:00
set_opt ( sb , JOURNAL_DATA ) ;
2018-03-30 07:56:10 +03:00
sbi - > s_def_mount_opt | = EXT4_MOUNT_JOURNAL_DATA ;
}
2006-10-11 12:20:50 +04:00
break ;
2006-10-11 12:20:53 +04:00
case EXT4_MOUNT_ORDERED_DATA :
case EXT4_MOUNT_WRITEBACK_DATA :
2006-10-11 12:21:01 +04:00
if ( ! jbd2_journal_check_available_features
( sbi - > s_journal , 0 , 0 , JBD2_FEATURE_INCOMPAT_REVOKE ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Journal does not support "
" requested data journaling mode " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2006-10-11 12:20:50 +04:00
}
2020-11-20 21:28:32 +03:00
break ;
2006-10-11 12:20:50 +04:00
default :
break ;
}
2016-12-04 00:20:53 +03:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA & &
test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit in data=ordered mode " ) ;
goto failed_mount_wq ;
}
2021-10-27 17:18:53 +03:00
set_task_ioprio ( sbi - > s_journal - > j_task , ctx - > journal_ioprio ) ;
2006-10-11 12:20:50 +04:00
2020-10-06 03:48:39 +03:00
sbi - > s_journal - > j_submit_inode_data_buffers =
2020-10-06 03:48:41 +03:00
ext4_journal_submit_inode_data_buffers ;
2020-10-06 03:48:39 +03:00
sbi - > s_journal - > j_finish_inode_data_buffers =
2020-10-06 03:48:41 +03:00
ext4_journal_finish_inode_data_buffers ;
2012-02-21 02:53:02 +04:00
2010-11-03 19:03:21 +03:00
no_journal :
2017-06-22 18:55:14 +03:00
if ( ! test_opt ( sb , NO_MBCACHE ) ) {
sbi - > s_ea_block_cache = ext4_xattr_create_cache ( ) ;
if ( ! sbi - > s_ea_block_cache ) {
2017-06-22 18:44:55 +03:00
ext4_msg ( sb , KERN_ERR ,
2017-06-22 18:55:14 +03:00
" Failed to create ea_block_cache " ) ;
2017-06-22 18:44:55 +03:00
goto failed_mount_wq ;
}
2017-06-22 18:55:14 +03:00
if ( ext4_has_feature_ea_inode ( sb ) ) {
sbi - > s_ea_inode_cache = ext4_xattr_create_cache ( ) ;
if ( ! sbi - > s_ea_inode_cache ) {
ext4_msg ( sb , KERN_ERR ,
" Failed to create ea_inode_cache " ) ;
goto failed_mount_wq ;
}
}
2014-03-19 03:24:49 +04:00
}
2019-07-22 19:26:24 +03:00
if ( ext4_has_feature_verity ( sb ) & & blocksize ! = PAGE_SIZE ) {
ext4_msg ( sb , KERN_ERR , " Unsupported blocksize for fs-verity " ) ;
goto failed_mount_wq ;
}
2012-07-10 00:27:05 +04:00
/*
* Get the # of file system overhead blocks from the
* superblock if present .
*/
2022-04-15 04:57:49 +03:00
sbi - > s_overhead = le32_to_cpu ( es - > s_overhead_clusters ) ;
/* ignore the precalculated value if it is ridiculous */
if ( sbi - > s_overhead > ext4_blocks_count ( es ) )
sbi - > s_overhead = 0 ;
/*
* If the bigalloc feature is not enabled recalculating the
* overhead doesn ' t take long , so we might as well just redo
* it to make sure we are using the correct value .
*/
if ( ! ext4_has_feature_bigalloc ( sb ) )
sbi - > s_overhead = 0 ;
if ( sbi - > s_overhead = = 0 ) {
2012-11-09 00:16:54 +04:00
err = ext4_calculate_overhead ( sb ) ;
if ( err )
2012-07-10 00:27:05 +04:00
goto failed_mount_wq ;
}
2011-02-01 13:42:42 +03:00
/*
* The maximum number of concurrent works can be high and
* concurrency isn ' t really necessary . Limit it to 1.
*/
2013-06-04 22:21:02 +04:00
EXT4_SB ( sb ) - > rsv_conversion_wq =
alloc_workqueue ( " ext4-rsv-conversion " , WQ_MEM_RECLAIM | WQ_UNBOUND , 1 ) ;
if ( ! EXT4_SB ( sb ) - > rsv_conversion_wq ) {
printk ( KERN_ERR " EXT4-fs: failed to create workqueue \n " ) ;
2012-11-09 00:16:54 +04:00
ret = - ENOMEM ;
2013-06-04 22:21:02 +04:00
goto failed_mount4 ;
}
2006-10-11 12:20:50 +04:00
/*
2006-10-11 12:21:01 +04:00
* The jbd2_journal_load will have done any necessary log recovery ,
2006-10-11 12:20:50 +04:00
* so we can safely mount the rest of the filesystem now .
*/
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 20:29:13 +03:00
root = ext4_iget ( sb , EXT4_ROOT_INO , EXT4_IGET_SPECIAL ) ;
2008-02-07 11:15:37 +03:00
if ( IS_ERR ( root ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " get root inode failed " ) ;
2008-02-07 11:15:37 +03:00
ret = PTR_ERR ( root ) ;
2011-02-28 04:42:06 +03:00
root = NULL ;
2006-10-11 12:20:50 +04:00
goto failed_mount4 ;
}
if ( ! S_ISDIR ( root - > i_mode ) | | ! root - > i_blocks | | ! root - > i_size ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " corrupt root inode, run e2fsck " ) ;
2012-01-10 00:53:24 +04:00
iput ( root ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount4 ;
}
ext4: Support case-insensitive file name lookups
This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.
A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.
The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.
* dcache handling:
For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().
d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.
For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.
* on-disk data:
Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.
DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.
* Dealing with invalid sequences:
By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.
* Normalization algorithm:
The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.
NFD seems to be the best normalization method for EXT4 because:
- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.
Although:
- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 21:12:08 +03:00
2012-01-09 07:15:13 +04:00
sb - > s_root = d_make_root ( root ) ;
2008-02-07 11:15:37 +03:00
if ( ! sb - > s_root ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " get root dentry failed " ) ;
2008-02-07 11:15:37 +03:00
ret = - ENOMEM ;
goto failed_mount4 ;
}
2006-10-11 12:20:50 +04:00
2018-05-14 06:02:19 +03:00
ret = ext4_setup_super ( sb , es , sb_rdonly ( sb ) ) ;
if ( ret = = - EROFS ) {
2017-11-28 00:05:09 +03:00
sb - > s_flags | = SB_RDONLY ;
2018-05-14 06:02:19 +03:00
ret = 0 ;
} else if ( ret )
goto failed_mount4a ;
2007-07-18 17:15:20 +04:00
2015-09-23 19:44:17 +03:00
ext4_set_resv_clusters ( sb ) ;
2013-04-10 06:11:22 +04:00
2020-07-28 16:04:37 +03:00
if ( test_opt ( sb , BLOCK_VALIDITY ) ) {
err = ext4_setup_system_zone ( sb ) ;
if ( err ) {
ext4_msg ( sb , KERN_ERR , " failed to initialize system "
" zone (%d) " , err ) ;
goto failed_mount4a ;
}
2014-07-11 21:55:40 +04:00
}
2020-10-15 23:37:59 +03:00
ext4_fc_replay_cleanup ( sb ) ;
2014-07-11 21:55:40 +04:00
ext4_ext_init ( sb ) ;
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 20:21:27 +03:00
/*
* Enable optimize_scan if number of groups is > threshold . This can be
* turned off by passing " mb_optimize_scan=0 " . This can also be
* turned on forcefully by passing " mb_optimize_scan=1 " .
*/
2022-03-08 12:52:00 +03:00
if ( ! ( ctx - > spec & EXT4_SPEC_mb_optimize_scan ) ) {
if ( sbi - > s_groups_count > = MB_DEFAULT_LINEAR_SCAN_THRESHOLD )
set_opt2 ( sb , MB_OPTIMIZE_SCAN ) ;
else
clear_opt2 ( sb , MB_OPTIMIZE_SCAN ) ;
}
ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-01 20:21:27 +03:00
2014-07-11 21:55:40 +04:00
err = ext4_mb_init ( sb ) ;
if ( err ) {
ext4_msg ( sb , KERN_ERR , " failed to initialize mballoc (%d) " ,
err ) ;
2011-10-06 20:10:11 +04:00
goto failed_mount5 ;
2008-10-11 04:07:20 +04:00
}
2021-01-21 20:33:20 +03:00
/*
* We can only set up the journal commit callback once
* mballoc is initialized
*/
if ( sbi - > s_journal )
sbi - > s_journal - > j_commit_callback =
ext4_journal_commit_callback ;
2014-07-15 14:01:38 +04:00
block = ext4_count_free_clusters ( sb ) ;
2021-04-09 07:20:35 +03:00
ext4_free_blocks_count_set ( sbi - > s_es ,
2014-07-15 14:01:38 +04:00
EXT4_C2B ( sbi , block ) ) ;
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_freeclusters_counter , block ,
GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
if ( ! err ) {
unsigned long freei = ext4_count_free_inodes ( sb ) ;
sbi - > s_es - > s_free_inodes_count = cpu_to_le32 ( freei ) ;
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_freeinodes_counter , freei ,
GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
}
if ( ! err )
err = percpu_counter_init ( & sbi - > s_dirs_counter ,
2014-09-08 04:51:29 +04:00
ext4_count_dirs ( sb ) , GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
if ( ! err )
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_dirtyclusters_counter , 0 ,
GFP_KERNEL ) ;
2021-02-18 18:11:32 +03:00
if ( ! err )
err = percpu_counter_init ( & sbi - > s_sra_exceeded_retry_limit , 0 ,
GFP_KERNEL ) ;
2016-04-26 06:22:35 +03:00
if ( ! err )
2020-02-19 21:30:46 +03:00
err = percpu_init_rwsem ( & sbi - > s_writepages_rwsem ) ;
2016-04-26 06:22:35 +03:00
2014-07-15 14:01:38 +04:00
if ( err ) {
ext4_msg ( sb , KERN_ERR , " insufficient memory " ) ;
goto failed_mount6 ;
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_flex_bg ( sb ) )
2014-07-15 14:01:38 +04:00
if ( ! ext4_fill_flex_info ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" unable to initialize "
" flex_bg meta info! " ) ;
2021-05-10 14:10:51 +03:00
ret = - ENOMEM ;
2014-07-15 14:01:38 +04:00
goto failed_mount6 ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
err = ext4_register_li_request ( sb , first_not_zeroed ) ;
if ( err )
2011-10-06 20:10:11 +04:00
goto failed_mount6 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2015-09-23 19:44:17 +03:00
err = ext4_register_sysfs ( sb ) ;
2011-10-06 20:10:11 +04:00
if ( err )
goto failed_mount7 ;
2009-03-31 17:10:09 +04:00
2021-08-16 12:57:06 +03:00
err = ext4_init_orphan_info ( sb ) ;
if ( err )
goto failed_mount8 ;
2013-03-03 03:22:38 +04:00
# ifdef CONFIG_QUOTA
/* Enable quota usage during mount. */
2017-07-17 10:45:34 +03:00
if ( ext4_has_feature_quota ( sb ) & & ! sb_rdonly ( sb ) ) {
2013-03-03 03:22:38 +04:00
err = ext4_enable_quotas ( sb ) ;
if ( err )
2021-08-16 12:57:06 +03:00
goto failed_mount9 ;
2013-03-03 03:22:38 +04:00
}
# endif /* CONFIG_QUOTA */
2020-06-20 05:54:23 +03:00
/*
* Save the original bdev mapping ' s wb_err value which could be
* used to detect the metadata async write error .
*/
spin_lock_init ( & sbi - > s_bdev_wb_lock ) ;
2020-09-28 05:05:56 +03:00
errseq_check_and_advance ( & sb - > s_bdev - > bd_inode - > i_mapping - > wb_err ,
& sbi - > s_bdev_wb_err ) ;
2020-06-20 05:54:23 +03:00
sb - > s_bdev - > bd_super = sb ;
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ORPHAN_FS ;
ext4_orphan_cleanup ( sb , es ) ;
EXT4_SB ( sb ) - > s_mount_state & = ~ EXT4_ORPHAN_FS ;
2022-05-25 04:29:04 +03:00
/*
* Update the checksum after updating free space / inode counters and
* ext4_orphan_cleanup . Otherwise the superblock can have an incorrect
* checksum in the buffer cache until it is written out and
* e2fsprogs programs trying to open a file system immediately
* after it is mounted can fail .
*/
ext4_superblock_csum_set ( sb ) ;
2009-01-07 08:06:22 +03:00
if ( needs_recovery ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " recovery complete " ) ;
2020-07-10 17:07:59 +03:00
err = ext4_mark_recovery_complete ( sb , es ) ;
if ( err )
2021-08-16 12:57:06 +03:00
goto failed_mount9 ;
2009-01-07 08:06:22 +03:00
}
2022-04-15 07:52:55 +03:00
if ( test_opt ( sb , DISCARD ) & & ! bdev_max_discard_sectors ( sb - > s_bdev ) )
ext4_msg ( sb , KERN_WARNING ,
" mounting with \" discard \" option, but the device does not support discard " ) ;
2012-11-08 22:28:29 +04:00
2010-07-27 19:56:04 +04:00
if ( es - > s_error_count )
mod_timer ( & sbi - > s_err_report , jiffies + 300 * HZ ) ; /* 5 minutes */
2006-10-11 12:20:50 +04:00
2013-10-18 05:11:01 +04:00
/* Enable message ratelimiting. Default is 10 messages per 5 secs. */
ratelimit_state_init ( & sbi - > s_err_ratelimit_state , 5 * HZ , 10 ) ;
ratelimit_state_init ( & sbi - > s_warning_ratelimit_state , 5 * HZ , 10 ) ;
ratelimit_state_init ( & sbi - > s_msg_ratelimit_state , 5 * HZ , 10 ) ;
2020-07-25 15:33:13 +03:00
atomic_set ( & sbi - > s_warning_count , 0 ) ;
atomic_set ( & sbi - > s_msg_count , 0 ) ;
2013-10-18 05:11:01 +04:00
2006-10-11 12:20:50 +04:00
return 0 ;
2006-10-11 12:20:53 +04:00
cantfind_ext4 :
2006-10-11 12:20:50 +04:00
if ( ! silent )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " VFS: Can't find ext4 filesystem " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
2021-08-16 12:57:06 +03:00
failed_mount9 :
ext4_release_orphan_info ( sb ) ;
2013-01-25 08:24:54 +04:00
failed_mount8 :
2015-09-23 19:46:17 +03:00
ext4_unregister_sysfs ( sb ) ;
2020-09-22 19:24:56 +03:00
kobject_put ( & sbi - > s_kobj ) ;
2011-10-06 20:10:11 +04:00
failed_mount7 :
ext4_unregister_li_request ( sb ) ;
failed_mount6 :
2014-07-11 21:55:40 +04:00
ext4_mb_release ( sb ) ;
2020-02-19 06:08:51 +03:00
rcu_read_lock ( ) ;
flex_groups = rcu_dereference ( sbi - > s_flex_groups ) ;
if ( flex_groups ) {
for ( i = 0 ; i < sbi - > s_flex_groups_allocated ; i + + )
kvfree ( flex_groups [ i ] ) ;
kvfree ( flex_groups ) ;
}
rcu_read_unlock ( ) ;
2014-07-15 14:01:38 +04:00
percpu_counter_destroy ( & sbi - > s_freeclusters_counter ) ;
percpu_counter_destroy ( & sbi - > s_freeinodes_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirs_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirtyclusters_counter ) ;
2021-02-18 18:11:32 +03:00
percpu_counter_destroy ( & sbi - > s_sra_exceeded_retry_limit ) ;
2020-02-19 21:30:46 +03:00
percpu_free_rwsem ( & sbi - > s_writepages_rwsem ) ;
ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.
And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
number of clusters returned by ext4_free_clusters_after_init() is not correct.
(see ext4_bg_num_gdb_nometa())
Here is the stack backtrace:
(gdb) bt
#0 ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
#1 ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
#2 0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
#3 0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0) at balloc.c:489
#4 0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
#5 0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
sb=0xffff880079a10000) at super.c:2143
#6 ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
...
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 18:54:20 +04:00
failed_mount5 :
2014-07-11 21:55:40 +04:00
ext4_ext_release ( sb ) ;
ext4_release_system_zone ( sb ) ;
failed_mount4a :
2012-01-10 00:53:24 +04:00
dput ( sb - > s_root ) ;
2011-02-28 04:42:06 +03:00
sb - > s_root = NULL ;
2012-01-10 00:53:24 +04:00
failed_mount4 :
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " mount failed " ) ;
2013-06-04 22:21:02 +04:00
if ( EXT4_SB ( sb ) - > rsv_conversion_wq )
destroy_workqueue ( EXT4_SB ( sb ) - > rsv_conversion_wq ) ;
2009-09-28 23:48:41 +04:00
failed_mount_wq :
2018-12-04 08:24:42 +03:00
ext4_xattr_destroy_cache ( sbi - > s_ea_inode_cache ) ;
sbi - > s_ea_inode_cache = NULL ;
ext4_xattr_destroy_cache ( sbi - > s_ea_block_cache ) ;
sbi - > s_ea_block_cache = NULL ;
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal ) {
2021-09-24 12:39:17 +03:00
/* flush s_error_work before journal destroy. */
flush_work ( & sbi - > s_error_work ) ;
2009-01-07 08:06:22 +03:00
jbd2_journal_destroy ( sbi - > s_journal ) ;
sbi - > s_journal = NULL ;
}
2014-10-30 17:53:16 +03:00
failed_mount3a :
2013-07-01 16:12:37 +04:00
ext4_es_unregister_shrinker ( sbi ) ;
2014-09-02 06:26:49 +04:00
failed_mount3 :
2021-09-24 12:39:17 +03:00
/* flush s_error_work before sbi destroy */
2020-11-27 14:34:00 +03:00
flush_work ( & sbi - > s_error_work ) ;
2021-03-15 19:59:06 +03:00
del_timer_sync ( & sbi - > s_err_report ) ;
2021-04-30 21:50:46 +03:00
ext4_stop_mmpd ( sbi ) ;
2006-10-11 12:20:50 +04:00
failed_mount2 :
2020-02-16 00:40:37 +03:00
rcu_read_lock ( ) ;
group_desc = rcu_dereference ( sbi - > s_group_desc ) ;
2006-10-11 12:20:50 +04:00
for ( i = 0 ; i < db_count ; i + + )
2020-02-16 00:40:37 +03:00
brelse ( group_desc [ i ] ) ;
kvfree ( group_desc ) ;
rcu_read_unlock ( ) ;
2006-10-11 12:20:50 +04:00
failed_mount :
2012-04-30 02:27:10 +04:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2019-04-25 21:05:42 +03:00
2022-01-18 09:56:14 +03:00
# if IS_ENABLED(CONFIG_UNICODE)
2020-10-28 08:08:20 +03:00
utf8_unload ( sb - > s_encoding ) ;
2019-04-25 21:05:42 +03:00
# endif
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2019-05-12 11:49:47 +03:00
kfree ( get_qf_name ( sb , sbi , i ) ) ;
2006-10-11 12:20:50 +04:00
# endif
fscrypt: handle test_dummy_encryption in more logical way
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0b6 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-09-17 07:11:35 +03:00
fscrypt_free_dummy_policy ( & sbi - > s_dummy_enc_policy ) ;
2021-05-21 10:55:33 +03:00
/* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
2021-05-21 10:55:33 +03:00
ext4_blkdev_remove ( sbi ) ;
2006-10-11 12:20:50 +04:00
out_fail :
sb - > s_fs_info = NULL ;
2012-11-09 00:16:54 +04:00
return err ? err : ret ;
2006-10-11 12:20:50 +04:00
}
2021-10-27 17:18:56 +03:00
static int ext4_fill_super ( struct super_block * sb , struct fs_context * fc )
2021-10-27 17:18:53 +03:00
{
2021-10-27 17:18:56 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
2021-10-27 17:18:53 +03:00
struct ext4_sb_info * sbi ;
const char * descr ;
2021-10-27 17:18:56 +03:00
int ret ;
2021-10-27 17:18:53 +03:00
sbi = ext4_alloc_sbi ( sb ) ;
2021-10-27 17:18:56 +03:00
if ( ! sbi )
2022-01-19 16:02:09 +03:00
return - ENOMEM ;
2021-10-27 17:18:53 +03:00
2021-10-27 17:18:56 +03:00
fc - > s_fs_info = sbi ;
/* Cleanup superblock name */
strreplace ( sb - > s_id , ' / ' , ' ! ' ) ;
2021-10-27 17:18:53 +03:00
sbi - > s_sb_block = 1 ; /* Default super block location */
2021-10-27 17:18:56 +03:00
if ( ctx - > spec & EXT4_SPEC_s_sb_block )
sbi - > s_sb_block = ctx - > s_sb_block ;
2021-10-27 17:18:53 +03:00
2021-12-22 13:45:17 +03:00
ret = __ext4_fill_super ( fc , sb ) ;
2021-10-27 17:18:53 +03:00
if ( ret < 0 )
goto free_sbi ;
2021-10-27 17:18:56 +03:00
if ( sbi - > s_journal ) {
2021-10-27 17:18:53 +03:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA )
descr = " journalled data mode " ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA )
descr = " ordered data mode " ;
else
descr = " writeback data mode " ;
} else
descr = " out journal " ;
if ( ___ratelimit ( & ext4_mount_msg_ratelimit , " EXT4-fs mount " ) )
ext4_msg ( sb , KERN_INFO , " mounted filesystem with%s. "
2021-10-27 17:18:56 +03:00
" Quota mode: %s. " , descr , ext4_quota_mode ( sb ) ) ;
2022-04-15 05:39:00 +03:00
/* Update the s_overhead_clusters if necessary */
2022-06-29 07:00:26 +03:00
ext4_update_overhead ( sb , false ) ;
2021-10-27 17:18:53 +03:00
return 0 ;
2021-10-27 17:18:56 +03:00
2021-10-27 17:18:53 +03:00
free_sbi :
ext4_free_sbi ( sbi ) ;
2021-10-27 17:18:56 +03:00
fc - > s_fs_info = NULL ;
2021-10-27 17:18:53 +03:00
return ret ;
}
2021-10-27 17:18:56 +03:00
static int ext4_get_tree ( struct fs_context * fc )
{
return get_tree_bdev ( fc , ext4_fill_super ) ;
}
2006-10-11 12:20:50 +04:00
/*
* Setup any per - fs journal parameters now . We ' ll do this both on
* initial mount , once the journal has been initialised but before we ' ve
* done any recovery ; and again on any subsequent remount .
*/
2006-10-11 12:20:53 +04:00
static void ext4_init_journal_params ( struct super_block * sb , journal_t * journal )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
2009-01-04 04:27:38 +03:00
journal - > j_commit_interval = sbi - > s_commit_interval ;
journal - > j_min_batch_time = sbi - > s_min_batch_time ;
journal - > j_max_batch_time = sbi - > s_max_batch_time ;
2020-10-15 23:37:55 +03:00
ext4_fc_init ( sb , journal ) ;
2006-10-11 12:20:50 +04:00
2010-08-04 05:35:12 +04:00
write_lock ( & journal - > j_state_lock ) ;
2006-10-11 12:20:50 +04:00
if ( test_opt ( sb , BARRIER ) )
2006-10-11 12:21:01 +04:00
journal - > j_flags | = JBD2_BARRIER ;
2006-10-11 12:20:50 +04:00
else
2006-10-11 12:21:01 +04:00
journal - > j_flags & = ~ JBD2_BARRIER ;
2008-10-11 06:12:43 +04:00
if ( test_opt ( sb , DATA_ERR_ABORT ) )
journal - > j_flags | = JBD2_ABORT_ON_SYNCDATA_ERR ;
else
journal - > j_flags & = ~ JBD2_ABORT_ON_SYNCDATA_ERR ;
2010-08-04 05:35:12 +04:00
write_unlock ( & journal - > j_state_lock ) ;
2006-10-11 12:20:50 +04:00
}
2016-09-30 09:05:09 +03:00
static struct inode * ext4_get_journal_inode ( struct super_block * sb ,
unsigned int journal_inum )
2006-10-11 12:20:50 +04:00
{
struct inode * journal_inode ;
2016-09-30 09:05:09 +03:00
/*
* Test for the existence of a valid inode on disk . Bad things
* happen if we iget ( ) an unused inode , as the subsequent iput ( )
* will try to delete it .
*/
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 20:29:13 +03:00
journal_inode = ext4_iget ( sb , journal_inum , EXT4_IGET_SPECIAL ) ;
2008-02-07 11:15:37 +03:00
if ( IS_ERR ( journal_inode ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " no journal found " ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
if ( ! journal_inode - > i_nlink ) {
make_bad_inode ( journal_inode ) ;
iput ( journal_inode ) ;
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " journal inode is deleted " ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
2022-06-08 14:23:47 +03:00
ext4_debug ( " Journal inode found at %p: %lld bytes \n " ,
2006-10-11 12:20:50 +04:00
journal_inode , journal_inode - > i_size ) ;
2008-02-07 11:15:37 +03:00
if ( ! S_ISREG ( journal_inode - > i_mode ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " invalid journal inode " ) ;
2006-10-11 12:20:50 +04:00
iput ( journal_inode ) ;
return NULL ;
}
2016-09-30 09:05:09 +03:00
return journal_inode ;
}
static journal_t * ext4_get_journal ( struct super_block * sb ,
unsigned int journal_inum )
{
struct inode * journal_inode ;
journal_t * journal ;
2020-07-10 17:07:59 +03:00
if ( WARN_ON_ONCE ( ! ext4_has_feature_journal ( sb ) ) )
return NULL ;
2016-09-30 09:05:09 +03:00
journal_inode = ext4_get_journal_inode ( sb , journal_inum ) ;
if ( ! journal_inode )
return NULL ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:21:01 +04:00
journal = jbd2_journal_init_inode ( journal_inode ) ;
2006-10-11 12:20:50 +04:00
if ( ! journal ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Could not load journal inode " ) ;
2006-10-11 12:20:50 +04:00
iput ( journal_inode ) ;
return NULL ;
}
journal - > j_private = sb ;
2006-10-11 12:20:53 +04:00
ext4_init_journal_params ( sb , journal ) ;
2006-10-11 12:20:50 +04:00
return journal ;
}
2006-10-11 12:20:53 +04:00
static journal_t * ext4_get_dev_journal ( struct super_block * sb ,
2006-10-11 12:20:50 +04:00
dev_t j_dev )
{
2008-07-27 00:15:44 +04:00
struct buffer_head * bh ;
2006-10-11 12:20:50 +04:00
journal_t * journal ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t start ;
ext4_fsblk_t len ;
2006-10-11 12:20:50 +04:00
int hblock , blocksize ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t sb_block ;
2006-10-11 12:20:50 +04:00
unsigned long offset ;
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es ;
2006-10-11 12:20:50 +04:00
struct block_device * bdev ;
2020-07-10 17:07:59 +03:00
if ( WARN_ON_ONCE ( ! ext4_has_feature_journal ( sb ) ) )
return NULL ;
2009-01-07 08:06:22 +03:00
2009-06-05 01:36:36 +04:00
bdev = ext4_blkdev_get ( j_dev , sb ) ;
2006-10-11 12:20:50 +04:00
if ( bdev = = NULL )
return NULL ;
blocksize = sb - > s_blocksize ;
2009-05-23 01:17:49 +04:00
hblock = bdev_logical_block_size ( bdev ) ;
2006-10-11 12:20:50 +04:00
if ( blocksize < hblock ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" blocksize too small for journal device " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
2006-10-11 12:20:53 +04:00
sb_block = EXT4_MIN_BLOCK_SIZE / blocksize ;
offset = EXT4_MIN_BLOCK_SIZE % blocksize ;
2006-10-11 12:20:50 +04:00
set_blocksize ( bdev , blocksize ) ;
if ( ! ( bh = __bread ( bdev , sb_block , blocksize ) ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't read superblock of "
" external journal " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:53 +04:00
if ( ( le16_to_cpu ( es - > s_magic ) ! = EXT4_SUPER_MAGIC ) | |
2006-10-11 12:20:50 +04:00
! ( le32_to_cpu ( es - > s_feature_incompat ) &
2006-10-11 12:20:53 +04:00
EXT4_FEATURE_INCOMPAT_JOURNAL_DEV ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " external journal has "
" bad superblock " ) ;
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
goto out_bdev ;
}
2014-09-11 19:44:36 +04:00
if ( ( le32_to_cpu ( es - > s_feature_ro_compat ) &
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM ) & &
es - > s_checksum ! = ext4_superblock_csum ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " external journal has "
" corrupt superblock " ) ;
brelse ( bh ) ;
goto out_bdev ;
}
2006-10-11 12:20:53 +04:00
if ( memcmp ( EXT4_SB ( sb ) - > s_es - > s_journal_uuid , es - > s_uuid , 16 ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " journal UUID does not match " ) ;
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
goto out_bdev ;
}
2006-10-11 12:21:10 +04:00
len = ext4_blocks_count ( es ) ;
2006-10-11 12:20:50 +04:00
start = sb_block + 1 ;
brelse ( bh ) ; /* we're done with the superblock */
2006-10-11 12:21:01 +04:00
journal = jbd2_journal_init_dev ( bdev , sb - > s_bdev ,
2006-10-11 12:20:50 +04:00
start , len , blocksize ) ;
if ( ! journal ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " failed to create device journal " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
journal - > j_private = sb ;
2020-09-24 10:33:33 +03:00
if ( ext4_read_bh_lock ( journal - > j_sb_buffer , REQ_META | REQ_PRIO , true ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " I/O error on journal device " ) ;
2006-10-11 12:20:50 +04:00
goto out_journal ;
}
if ( be32_to_cpu ( journal - > j_superblock - > s_nr_users ) ! = 1 ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " External journal has more than one "
" user (unsupported) - %d " ,
2006-10-11 12:20:50 +04:00
be32_to_cpu ( journal - > j_superblock - > s_nr_users ) ) ;
goto out_journal ;
}
2020-09-24 06:03:42 +03:00
EXT4_SB ( sb ) - > s_journal_bdev = bdev ;
2006-10-11 12:20:53 +04:00
ext4_init_journal_params ( sb , journal ) ;
2006-10-11 12:20:50 +04:00
return journal ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
out_journal :
2006-10-11 12:21:01 +04:00
jbd2_journal_destroy ( journal ) ;
2006-10-11 12:20:50 +04:00
out_bdev :
2006-10-11 12:20:53 +04:00
ext4_blkdev_put ( bdev ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
2006-10-11 12:20:53 +04:00
static int ext4_load_journal ( struct super_block * sb ,
struct ext4_super_block * es ,
2006-10-11 12:20:50 +04:00
unsigned long journal_devnum )
{
journal_t * journal ;
unsigned int journal_inum = le32_to_cpu ( es - > s_journal_inum ) ;
dev_t journal_dev ;
int err = 0 ;
int really_read_only ;
2020-07-17 12:06:05 +03:00
int journal_dev_ro ;
2006-10-11 12:20:50 +04:00
2020-07-10 17:07:59 +03:00
if ( WARN_ON_ONCE ( ! ext4_has_feature_journal ( sb ) ) )
return - EFSCORRUPTED ;
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:50 +04:00
if ( journal_devnum & &
journal_devnum ! = le32_to_cpu ( es - > s_journal_dev ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " external journal device major/minor "
" numbers have changed " ) ;
2006-10-11 12:20:50 +04:00
journal_dev = new_decode_dev ( journal_devnum ) ;
} else
journal_dev = new_decode_dev ( le32_to_cpu ( es - > s_journal_dev ) ) ;
2020-07-17 12:06:05 +03:00
if ( journal_inum & & journal_dev ) {
ext4_msg ( sb , KERN_ERR ,
" filesystem has both journal inode and journal device! " ) ;
return - EINVAL ;
}
if ( journal_inum ) {
journal = ext4_get_journal ( sb , journal_inum ) ;
if ( ! journal )
return - EINVAL ;
} else {
journal = ext4_get_dev_journal ( sb , journal_dev ) ;
if ( ! journal )
return - EINVAL ;
}
journal_dev_ro = bdev_read_only ( journal - > j_dev ) ;
really_read_only = bdev_read_only ( sb - > s_bdev ) | journal_dev_ro ;
if ( journal_dev_ro & & ! sb_rdonly ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" journal device read-only, try mounting with '-o ro' " ) ;
err = - EROFS ;
goto err_out ;
}
2006-10-11 12:20:50 +04:00
/*
* Are we loading a blank journal or performing recovery after a
* crash ? For recovery , we need to check in advance whether we
* can get read - write access to the device .
*/
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_journal_needs_recovery ( sb ) ) {
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " INFO: recovery "
" required on readonly filesystem " ) ;
2006-10-11 12:20:50 +04:00
if ( really_read_only ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " write access "
2017-10-18 20:06:37 +03:00
" unavailable, cannot proceed "
" (try mounting with noload) " ) ;
2020-07-17 12:06:05 +03:00
err = - EROFS ;
goto err_out ;
2006-10-11 12:20:50 +04:00
}
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " write access will "
" be enabled during recovery " ) ;
2006-10-11 12:20:50 +04:00
}
}
2009-09-29 23:51:30 +04:00
if ( ! ( journal - > j_flags & JBD2_BARRIER ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " barriers disabled " ) ;
2008-09-09 07:00:52 +04:00
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal_needs_recovery ( sb ) )
2006-10-11 12:21:01 +04:00
err = jbd2_journal_wipe ( journal , ! really_read_only ) ;
2010-07-27 19:56:03 +04:00
if ( ! err ) {
char * save = kmalloc ( EXT4_S_ERR_LEN , GFP_KERNEL ) ;
if ( save )
memcpy ( save , ( ( char * ) es ) +
EXT4_S_ERR_START , EXT4_S_ERR_LEN ) ;
2006-10-11 12:21:01 +04:00
err = jbd2_journal_load ( journal ) ;
2010-07-27 19:56:03 +04:00
if ( save )
memcpy ( ( ( char * ) es ) + EXT4_S_ERR_START ,
save , EXT4_S_ERR_LEN ) ;
kfree ( save ) ;
}
2006-10-11 12:20:50 +04:00
if ( err ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " error loading journal " ) ;
2020-07-17 12:06:05 +03:00
goto err_out ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_journal = journal ;
2020-07-10 17:07:59 +03:00
err = ext4_clear_journal_err ( sb , es ) ;
if ( err ) {
EXT4_SB ( sb ) - > s_journal = NULL ;
jbd2_journal_destroy ( journal ) ;
return err ;
}
2006-10-11 12:20:50 +04:00
2010-10-28 05:30:06 +04:00
if ( ! really_read_only & & journal_devnum & &
2006-10-11 12:20:50 +04:00
journal_devnum ! = le32_to_cpu ( es - > s_journal_dev ) ) {
es - > s_journal_dev = cpu_to_le32 ( journal_devnum ) ;
/* Make sure we flush the recovery flag to disk. */
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sb ) ;
2006-10-11 12:20:50 +04:00
}
return 0 ;
2020-07-17 12:06:05 +03:00
err_out :
jbd2_journal_destroy ( journal ) ;
return err ;
2006-10-11 12:20:50 +04:00
}
2020-12-16 13:18:40 +03:00
/* Copy state of EXT4_SB(sb) into buffer for on-disk superblock */
static void ext4_update_super ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2020-11-27 14:34:00 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2020-12-16 13:18:41 +03:00
struct ext4_super_block * es = sbi - > s_es ;
struct buffer_head * sbh = sbi - > s_sbh ;
2018-07-03 01:45:18 +03:00
2020-12-16 13:18:39 +03:00
lock_buffer ( sbh ) ;
2009-09-11 01:31:04 +04:00
/*
* If the file system is mounted read - only , don ' t update the
* superblock write time . This avoids updating the superblock
* write time when we are mounting the root file system
* read / only but we need to replay the journal ; at that point ,
* for people who are east of GMT and who make their clock
* tick in localtime for Windows bug - for - bug compatibility ,
* the clock is set in the future , and this will cause e2fsck
* to complain and force a full file system check .
*/
2017-11-28 00:05:09 +03:00
if ( ! ( sb - > s_flags & SB_RDONLY ) )
2018-07-29 22:51:48 +03:00
ext4_update_tstamp ( es , s_wtime ) ;
2020-11-24 11:36:54 +03:00
es - > s_kbytes_written =
2021-01-16 01:54:24 +03:00
cpu_to_le64 ( sbi - > s_kbytes_written +
2020-11-24 11:36:54 +03:00
( ( part_stat_read ( sb - > s_bdev , sectors [ STAT_WRITE ] ) -
2021-01-16 01:54:24 +03:00
sbi - > s_sectors_written_start ) > > 1 ) ) ;
2020-12-16 13:18:41 +03:00
if ( percpu_counter_initialized ( & sbi - > s_freeclusters_counter ) )
2014-07-15 14:01:38 +04:00
ext4_free_blocks_count_set ( es ,
2020-12-16 13:18:41 +03:00
EXT4_C2B ( sbi , percpu_counter_sum_positive (
& sbi - > s_freeclusters_counter ) ) ) ;
if ( percpu_counter_initialized ( & sbi - > s_freeinodes_counter ) )
2014-07-15 14:01:38 +04:00
es - > s_free_inodes_count =
cpu_to_le32 ( percpu_counter_sum_positive (
2020-12-16 13:18:41 +03:00
& sbi - > s_freeinodes_counter ) ) ;
2020-11-27 14:34:00 +03:00
/* Copy error information to the on-disk superblock */
spin_lock ( & sbi - > s_error_lock ) ;
if ( sbi - > s_add_error_count > 0 ) {
es - > s_state | = cpu_to_le16 ( EXT4_ERROR_FS ) ;
if ( ! es - > s_first_error_time & & ! es - > s_first_error_time_hi ) {
__ext4_update_tstamp ( & es - > s_first_error_time ,
& es - > s_first_error_time_hi ,
sbi - > s_first_error_time ) ;
strncpy ( es - > s_first_error_func , sbi - > s_first_error_func ,
sizeof ( es - > s_first_error_func ) ) ;
es - > s_first_error_line =
cpu_to_le32 ( sbi - > s_first_error_line ) ;
es - > s_first_error_ino =
cpu_to_le32 ( sbi - > s_first_error_ino ) ;
es - > s_first_error_block =
cpu_to_le64 ( sbi - > s_first_error_block ) ;
es - > s_first_error_errcode =
ext4_errno_to_code ( sbi - > s_first_error_code ) ;
}
__ext4_update_tstamp ( & es - > s_last_error_time ,
& es - > s_last_error_time_hi ,
sbi - > s_last_error_time ) ;
strncpy ( es - > s_last_error_func , sbi - > s_last_error_func ,
sizeof ( es - > s_last_error_func ) ) ;
es - > s_last_error_line = cpu_to_le32 ( sbi - > s_last_error_line ) ;
es - > s_last_error_ino = cpu_to_le32 ( sbi - > s_last_error_ino ) ;
es - > s_last_error_block = cpu_to_le64 ( sbi - > s_last_error_block ) ;
es - > s_last_error_errcode =
ext4_errno_to_code ( sbi - > s_last_error_code ) ;
/*
* Start the daily error reporting function if it hasn ' t been
* started already
*/
if ( ! es - > s_error_count )
mod_timer ( & sbi - > s_err_report , jiffies + 24 * 60 * 60 * HZ ) ;
le32_add_cpu ( & es - > s_error_count , sbi - > s_add_error_count ) ;
sbi - > s_add_error_count = 0 ;
}
spin_unlock ( & sbi - > s_error_lock ) ;
2012-10-10 09:06:58 +04:00
ext4_superblock_csum_set ( sb ) ;
2020-12-16 13:18:40 +03:00
unlock_buffer ( sbh ) ;
}
static int ext4_commit_super ( struct super_block * sb )
{
struct buffer_head * sbh = EXT4_SB ( sb ) - > s_sbh ;
2021-04-02 13:16:31 +03:00
if ( ! sbh )
return - EINVAL ;
if ( block_device_ejected ( sb ) )
return - ENODEV ;
2020-12-16 13:18:40 +03:00
ext4_update_super ( sb ) ;
2022-05-20 05:32:16 +03:00
lock_buffer ( sbh ) ;
/* Buffer got discarded which means block device got invalidated */
if ( ! buffer_mapped ( sbh ) ) {
unlock_buffer ( sbh ) ;
return - EIO ;
}
2018-12-31 07:20:39 +03:00
if ( buffer_write_io_error ( sbh ) | | ! buffer_uptodate ( sbh ) ) {
2016-07-04 17:24:52 +03:00
/*
* Oh , dear . A previous attempt to write the
* superblock failed . This could happen because the
* USB device was yanked out . Or it could happen to
* be a transient write error and maybe the block will
* be remapped . Nothing we can do but to retry the
* write and hope for the best .
*/
ext4_msg ( sb , KERN_ERR , " previous I/O error to "
" superblock detected " ) ;
clear_buffer_write_io_error ( sbh ) ;
set_buffer_uptodate ( sbh ) ;
}
2022-05-20 05:32:16 +03:00
get_bh ( sbh ) ;
/* Clear potential dirty bit if it was journalled update */
clear_buffer_dirty ( sbh ) ;
sbh - > b_end_io = end_buffer_write_sync ;
2022-07-14 21:07:13 +03:00
submit_bh ( REQ_OP_WRITE | REQ_SYNC |
( test_opt ( sb , BARRIER ) ? REQ_FUA : 0 ) , sbh ) ;
2022-05-20 05:32:16 +03:00
wait_on_buffer ( sbh ) ;
2020-12-16 13:18:38 +03:00
if ( buffer_write_io_error ( sbh ) ) {
ext4_msg ( sb , KERN_ERR , " I/O error while writing "
" superblock " ) ;
clear_buffer_write_io_error ( sbh ) ;
set_buffer_uptodate ( sbh ) ;
2022-05-20 05:32:16 +03:00
return - EIO ;
2008-10-07 05:35:40 +04:00
}
2022-05-20 05:32:16 +03:00
return 0 ;
2006-10-11 12:20:50 +04:00
}
/*
* Have we just finished recovery ? If so , and if we are mounting ( or
* remounting ) the filesystem readonly , then we will end up with a
* consistent fs on disk . Record that fact .
*/
2020-07-10 17:07:59 +03:00
static int ext4_mark_recovery_complete ( struct super_block * sb ,
struct ext4_super_block * es )
2006-10-11 12:20:50 +04:00
{
2020-07-10 17:07:59 +03:00
int err ;
2006-10-11 12:20:53 +04:00
journal_t * journal = EXT4_SB ( sb ) - > s_journal ;
2006-10-11 12:20:50 +04:00
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal ( sb ) ) {
2020-07-10 17:07:59 +03:00
if ( journal ! = NULL ) {
ext4_error ( sb , " Journal got removed while the fs was "
" mounted! " ) ;
return - EFSCORRUPTED ;
}
return 0 ;
2009-01-07 08:06:22 +03:00
}
2006-10-11 12:21:01 +04:00
jbd2_journal_lock_updates ( journal ) ;
2021-05-18 18:13:25 +03:00
err = jbd2_journal_flush ( journal , 0 ) ;
2020-07-10 17:07:59 +03:00
if ( err < 0 )
2008-10-11 04:29:21 +04:00
goto out ;
2021-08-16 12:57:06 +03:00
if ( sb_rdonly ( sb ) & & ( ext4_has_feature_journal_needs_recovery ( sb ) | |
ext4_has_feature_orphan_present ( sb ) ) ) {
if ( ! ext4_orphan_file_empty ( sb ) ) {
ext4_error ( sb , " Orphan file not empty on read-only fs. " ) ;
err = - EFSCORRUPTED ;
goto out ;
}
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2021-08-16 12:57:06 +03:00
ext4_clear_feature_orphan_present ( sb ) ;
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sb ) ;
2006-10-11 12:20:50 +04:00
}
2008-10-11 04:29:21 +04:00
out :
2006-10-11 12:21:01 +04:00
jbd2_journal_unlock_updates ( journal ) ;
2020-07-10 17:07:59 +03:00
return err ;
2006-10-11 12:20:50 +04:00
}
/*
* If we are mounting ( or read - write remounting ) a filesystem whose journal
* has recorded an error from a previous lifetime , move that error to the
* main filesystem now .
*/
2020-07-10 17:07:59 +03:00
static int ext4_clear_journal_err ( struct super_block * sb ,
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es )
2006-10-11 12:20:50 +04:00
{
journal_t * journal ;
int j_errno ;
const char * errstr ;
2020-07-10 17:07:59 +03:00
if ( ! ext4_has_feature_journal ( sb ) ) {
ext4_error ( sb , " Journal got removed while the fs was mounted! " ) ;
return - EFSCORRUPTED ;
}
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:53 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2006-10-11 12:20:50 +04:00
/*
* Now check for any error status which may have been recorded in the
2006-10-11 12:20:53 +04:00
* journal by a prior ext4_error ( ) or ext4_abort ( )
2006-10-11 12:20:50 +04:00
*/
2006-10-11 12:21:01 +04:00
j_errno = jbd2_journal_errno ( journal ) ;
2006-10-11 12:20:50 +04:00
if ( j_errno ) {
char nbuf [ 16 ] ;
2006-10-11 12:20:53 +04:00
errstr = ext4_decode_error ( sb , j_errno , nbuf ) ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb , " Filesystem error recorded "
2006-10-11 12:20:50 +04:00
" from previous mount: %s " , errstr ) ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb , " Marking fs in need of filesystem check. " ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ERROR_FS ;
es - > s_state | = cpu_to_le16 ( EXT4_ERROR_FS ) ;
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sb ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:21:01 +04:00
jbd2_journal_clear_err ( journal ) ;
2012-08-06 03:04:57 +04:00
jbd2_journal_update_sb_errno ( journal ) ;
2006-10-11 12:20:50 +04:00
}
2020-07-10 17:07:59 +03:00
return 0 ;
2006-10-11 12:20:50 +04:00
}
/*
* Force the running and committing transactions to commit ,
* and wait on the commit .
*/
2006-10-11 12:20:53 +04:00
int ext4_force_commit ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
journal_t * journal ;
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) )
2006-10-11 12:20:50 +04:00
return 0 ;
2006-10-11 12:20:53 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2013-01-29 06:41:02 +04:00
return ext4_journal_force_commit ( journal ) ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static int ext4_sync_fs ( struct super_block * sb , int wait )
2006-10-11 12:20:50 +04:00
{
2008-11-04 02:10:55 +03:00
int ret = 0 ;
2009-02-10 14:46:05 +03:00
tid_t target ;
2013-06-13 06:25:07 +04:00
bool needs_barrier = false ;
2009-09-28 23:48:29 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
2018-01-11 21:17:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( sbi ) ) )
2017-02-05 09:28:48 +03:00
return 0 ;
2009-06-17 19:48:11 +04:00
trace_ext4_sync_fs ( sb , wait ) ;
2013-06-04 22:21:02 +04:00
flush_workqueue ( sbi - > rsv_conversion_wq ) ;
2012-07-03 18:45:29 +04:00
/*
* Writeback quota in non - journalled quota case - journalled quota has
* no dirty dquots
*/
dquot_writeback_dquots ( sb , - 1 ) ;
2013-06-13 06:25:07 +04:00
/*
* Data writeback is possible w / o journal transaction , so barrier must
* being sent at the end of the function . But we can skip it if
* transaction_commit will do it for us .
*/
2014-09-19 00:12:37 +04:00
if ( sbi - > s_journal ) {
target = jbd2_get_latest_transaction ( sbi - > s_journal ) ;
if ( wait & & sbi - > s_journal - > j_flags & JBD2_BARRIER & &
! jbd2_trans_will_send_data_barrier ( sbi - > s_journal , target ) )
needs_barrier = true ;
if ( jbd2_journal_start_commit ( sbi - > s_journal , & target ) ) {
if ( wait )
ret = jbd2_log_wait_commit ( sbi - > s_journal ,
target ) ;
}
} else if ( wait & & test_opt ( sb , BARRIER ) )
2013-06-13 06:25:07 +04:00
needs_barrier = true ;
if ( needs_barrier ) {
int err ;
2021-01-26 17:52:35 +03:00
err = blkdev_issue_flush ( sb - > s_bdev ) ;
2013-06-13 06:25:07 +04:00
if ( ! ret )
ret = err ;
2009-01-07 08:06:22 +03:00
}
2013-06-13 06:25:07 +04:00
return ret ;
}
2006-10-11 12:20:50 +04:00
/*
* LVM calls this function before a ( read - only ) snapshot is created . This
* gives us a chance to flush the journal completely and mark the fs clean .
2011-04-11 06:06:07 +04:00
*
* Note that only this function cannot bring a filesystem to be in a clean
2012-06-12 18:20:38 +04:00
* state independently . It relies on upper layer to stop all data & metadata
* modifications .
2006-10-11 12:20:50 +04:00
*/
2009-01-10 03:40:58 +03:00
static int ext4_freeze ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2009-01-10 03:40:58 +03:00
int error = 0 ;
journal_t * journal ;
2006-10-11 12:20:50 +04:00
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) )
2009-05-01 20:52:25 +04:00
return 0 ;
2006-10-11 12:20:50 +04:00
2009-05-01 20:52:25 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2008-10-11 04:29:21 +04:00
2014-09-19 01:12:02 +04:00
if ( journal ) {
/* Now we set up the journal barrier. */
jbd2_journal_lock_updates ( journal ) ;
2006-10-11 12:20:50 +04:00
2014-09-19 01:12:02 +04:00
/*
* Don ' t clear the needs_recovery flag if we failed to
* flush the journal .
*/
2021-05-18 18:13:25 +03:00
error = jbd2_journal_flush ( journal , 0 ) ;
2014-09-19 01:12:02 +04:00
if ( error < 0 )
goto out ;
2015-08-15 17:45:06 +03:00
/* Journal blocked and flushed, clear needs_recovery flag. */
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2021-08-16 12:57:06 +03:00
if ( ext4_orphan_file_empty ( sb ) )
ext4_clear_feature_orphan_present ( sb ) ;
2014-09-19 01:12:02 +04:00
}
2009-05-01 20:52:25 +04:00
2020-12-16 13:18:38 +03:00
error = ext4_commit_super ( sb ) ;
2010-05-16 10:00:00 +04:00
out :
2014-09-19 01:12:02 +04:00
if ( journal )
/* we rely on upper layer to stop further updates */
jbd2_journal_unlock_updates ( journal ) ;
2010-05-16 10:00:00 +04:00
return error ;
2006-10-11 12:20:50 +04:00
}
/*
* Called by LVM after the snapshot is done . We need to reset the RECOVER
* flag here , even though the filesystem is not technically dirty yet .
*/
2009-01-10 03:40:58 +03:00
static int ext4_unfreeze ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) | | ext4_forced_shutdown ( EXT4_SB ( sb ) ) )
2009-05-01 20:52:25 +04:00
return 0 ;
2015-08-15 17:45:06 +03:00
if ( EXT4_SB ( sb ) - > s_journal ) {
/* Reset the needs_recovery flag before the fs is unlocked. */
2015-10-17 23:18:43 +03:00
ext4_set_feature_journal_needs_recovery ( sb ) ;
2021-08-16 12:57:06 +03:00
if ( ext4_has_feature_orphan_file ( sb ) )
ext4_set_feature_orphan_present ( sb ) ;
2015-08-15 17:45:06 +03:00
}
2020-12-16 13:18:38 +03:00
ext4_commit_super ( sb ) ;
2009-01-10 03:40:58 +03:00
return 0 ;
2006-10-11 12:20:50 +04:00
}
2010-12-16 04:28:48 +03:00
/*
* Structure to save mount options for ext4_remount ' s benefit
*/
struct ext4_mount_options {
unsigned long s_mount_opt ;
2010-12-16 04:30:48 +03:00
unsigned long s_mount_opt2 ;
2012-02-08 03:41:49 +04:00
kuid_t s_resuid ;
kgid_t s_resgid ;
2010-12-16 04:28:48 +03:00
unsigned long s_commit_interval ;
u32 s_min_batch_time , s_max_batch_time ;
# ifdef CONFIG_QUOTA
int s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
char * s_qf_names [ EXT4_MAXQUOTAS ] ;
2010-12-16 04:28:48 +03:00
# endif
} ;
2021-12-22 13:45:17 +03:00
static int __ext4_remount ( struct fs_context * fc , struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2021-10-27 17:18:53 +03:00
struct ext4_fs_context * ctx = fc - > fs_private ;
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es ;
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2021-12-22 13:45:17 +03:00
unsigned long old_sb_flags ;
2006-10-11 12:20:53 +04:00
struct ext4_mount_options old_opts ;
2008-07-26 22:34:21 +04:00
ext4_group_t g ;
2011-05-25 02:31:25 +04:00
int err = 0 ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2021-08-24 06:49:29 +03:00
int enable_quota = 0 ;
2013-01-25 08:24:58 +04:00
int i , j ;
2018-10-12 16:28:09 +03:00
char * to_free [ EXT4_MAXQUOTAS ] ;
2006-10-11 12:20:50 +04:00
# endif
2018-07-29 22:51:54 +03:00
2006-10-11 12:20:50 +04:00
/* Store the original options */
old_sb_flags = sb - > s_flags ;
old_opts . s_mount_opt = sbi - > s_mount_opt ;
2010-12-16 04:30:48 +03:00
old_opts . s_mount_opt2 = sbi - > s_mount_opt2 ;
2006-10-11 12:20:50 +04:00
old_opts . s_resuid = sbi - > s_resuid ;
old_opts . s_resgid = sbi - > s_resgid ;
old_opts . s_commit_interval = sbi - > s_commit_interval ;
2009-01-04 04:27:38 +03:00
old_opts . s_min_batch_time = sbi - > s_min_batch_time ;
old_opts . s_max_batch_time = sbi - > s_max_batch_time ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
old_opts . s_jquota_fmt = sbi - > s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2013-01-25 08:24:58 +04:00
if ( sbi - > s_qf_names [ i ] ) {
2018-10-12 16:28:09 +03:00
char * qf_name = get_qf_name ( sb , sbi , i ) ;
old_opts . s_qf_names [ i ] = kstrdup ( qf_name , GFP_KERNEL ) ;
2013-01-25 08:24:58 +04:00
if ( ! old_opts . s_qf_names [ i ] ) {
for ( j = 0 ; j < i ; j + + )
kfree ( old_opts . s_qf_names [ j ] ) ;
return - ENOMEM ;
}
} else
old_opts . s_qf_names [ i ] = NULL ;
2006-10-11 12:20:50 +04:00
# endif
2022-04-18 11:35:45 +03:00
if ( ! ( ctx - > spec & EXT4_SPEC_JOURNAL_IOPRIO ) ) {
if ( sbi - > s_journal & & sbi - > s_journal - > j_task - > io_context )
ctx - > journal_ioprio =
sbi - > s_journal - > j_task - > io_context - > ioprio ;
else
ctx - > journal_ioprio = DEFAULT_JOURNAL_IOPRIO ;
}
2006-10-11 12:20:50 +04:00
2021-10-27 17:18:53 +03:00
ext4_apply_options ( fc , sb ) ;
2006-10-11 12:20:50 +04:00
2014-10-30 17:53:16 +03:00
if ( ( old_opts . s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM ) ^
2014-11-26 00:20:50 +03:00
test_opt ( sb , JOURNAL_CHECKSUM ) ) {
ext4_msg ( sb , KERN_ERR , " changing journal_checksum "
2015-02-13 07:07:37 +03:00
" during remount not supported; ignoring " ) ;
sbi - > s_mount_opt ^ = EXT4_MOUNT_JOURNAL_CHECKSUM ;
2014-10-30 17:53:16 +03:00
}
2013-08-09 07:02:24 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA ) {
if ( test_opt2 ( sb , EXPLICIT_DELALLOC ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and delalloc " ) ;
err = - EINVAL ;
goto restore_opts ;
}
if ( test_opt ( sb , DIOREAD_NOLOCK ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and dioread_nolock " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2016-12-04 00:20:53 +03:00
} else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA ) {
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit in data=ordered mode " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2015-02-17 02:59:38 +03:00
}
2017-06-22 18:55:14 +03:00
if ( ( sbi - > s_mount_opt ^ old_opts . s_mount_opt ) & EXT4_MOUNT_NO_MBCACHE ) {
ext4_msg ( sb , KERN_ERR , " can't enable nombcache during remount " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2020-11-06 06:59:09 +03:00
if ( ext4_test_mount_flag ( sb , EXT4_MF_FS_ABORTED ) )
2021-10-26 20:33:02 +03:00
ext4_abort ( sb , ESHUTDOWN , " Abort forced by user " ) ;
2006-10-11 12:20:50 +04:00
2017-11-28 00:05:09 +03:00
sb - > s_flags = ( sb - > s_flags & ~ SB_POSIXACL ) |
( test_opt ( sb , POSIX_ACL ) ? SB_POSIXACL : 0 ) ;
2006-10-11 12:20:50 +04:00
es = sbi - > s_es ;
2009-01-06 06:46:26 +03:00
if ( sbi - > s_journal ) {
2009-01-07 08:06:22 +03:00
ext4_init_journal_params ( sb , sbi - > s_journal ) ;
2021-10-27 17:18:53 +03:00
set_task_ioprio ( sbi - > s_journal - > j_task , ctx - > journal_ioprio ) ;
2009-01-06 06:46:26 +03:00
}
2006-10-11 12:20:50 +04:00
2020-11-27 14:34:00 +03:00
/* Flush outstanding errors before changing fs state */
flush_work ( & sbi - > s_error_work ) ;
2021-12-22 13:45:17 +03:00
if ( ( bool ) ( fc - > sb_flags & SB_RDONLY ) ! = sb_rdonly ( sb ) ) {
2020-11-06 06:59:09 +03:00
if ( ext4_test_mount_flag ( sb , EXT4_MF_FS_ABORTED ) ) {
2006-10-11 12:20:50 +04:00
err = - EROFS ;
goto restore_opts ;
}
2021-12-22 13:45:17 +03:00
if ( fc - > sb_flags & SB_RDONLY ) {
2014-03-14 06:49:42 +04:00
err = sync_filesystem ( sb ) ;
if ( err < 0 )
goto restore_opts ;
2010-05-19 15:16:41 +04:00
err = dquot_suspend ( sb , - 1 ) ;
if ( err < 0 )
2010-05-19 15:16:40 +04:00
goto restore_opts ;
2006-10-11 12:20:50 +04:00
/*
* First of all , the unconditional stuff we have to do
* to disable replay of the journal when we next remount
*/
2017-11-28 00:05:09 +03:00
sb - > s_flags | = SB_RDONLY ;
2006-10-11 12:20:50 +04:00
/*
* OK , test if we are remounting a valid rw partition
* readonly , and if so set the rdonly flag and then
* mark the partition as valid again .
*/
2006-10-11 12:20:53 +04:00
if ( ! ( es - > s_state & cpu_to_le16 ( EXT4_VALID_FS ) ) & &
( sbi - > s_mount_state & EXT4_VALID_FS ) )
2006-10-11 12:20:50 +04:00
es - > s_state = cpu_to_le16 ( sbi - > s_mount_state ) ;
2020-07-10 17:07:59 +03:00
if ( sbi - > s_journal ) {
/*
* We let remount - ro finish even if marking fs
* as clean failed . . .
*/
2009-01-07 08:06:22 +03:00
ext4_mark_recovery_complete ( sb , es ) ;
2020-07-10 17:07:59 +03:00
}
2006-10-11 12:20:50 +04:00
} else {
2009-08-18 08:20:23 +04:00
/* Make sure we can mount this feature set readwrite */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_readonly ( sb ) | |
2015-02-13 06:31:21 +03:00
! ext4_feature_set_ok ( sb , 0 ) ) {
2006-10-11 12:20:50 +04:00
err = - EROFS ;
goto restore_opts ;
}
2008-07-26 22:34:21 +04:00
/*
* Make sure the group descriptor checksums
2009-06-04 01:59:28 +04:00
* are sane . If they aren ' t , refuse to remount r / w .
2008-07-26 22:34:21 +04:00
*/
for ( g = 0 ; g < sbi - > s_groups_count ; g + + ) {
struct ext4_group_desc * gdp =
ext4_get_group_desc ( sb , g , NULL ) ;
2012-04-30 02:45:10 +04:00
if ( ! ext4_group_desc_csum_verify ( sb , g , gdp ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" ext4_remount: Checksum for group %u failed (%u!=%u) " ,
2015-10-17 23:18:43 +03:00
g , le16_to_cpu ( ext4_group_desc_csum ( sb , g , gdp ) ) ,
2008-07-26 22:34:21 +04:00
le16_to_cpu ( gdp - > bg_checksum ) ) ;
2015-10-17 23:16:04 +03:00
err = - EFSBADCRC ;
2008-07-26 22:34:21 +04:00
goto restore_opts ;
}
}
2007-02-10 12:46:08 +03:00
/*
* If we have an unprocessed orphan list hanging
* around from a previously readonly bdev mount ,
* require a full umount / remount for now .
*/
2021-08-16 12:57:06 +03:00
if ( es - > s_last_orphan | | ! ext4_orphan_file_empty ( sb ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " Couldn't "
2007-02-10 12:46:08 +03:00
" remount RDWR because of unprocessed "
" orphan inode list. Please "
2009-06-05 01:36:36 +04:00
" umount/remount instead " ) ;
2007-02-10 12:46:08 +03:00
err = - EINVAL ;
goto restore_opts ;
}
2006-10-11 12:20:50 +04:00
/*
* Mounting a RDONLY partition read - write , so reread
* and store the current valid flag . ( It may have
* been changed by e2fsck since we originally mounted
* the partition . )
*/
2020-07-10 17:07:59 +03:00
if ( sbi - > s_journal ) {
err = ext4_clear_journal_err ( sb , es ) ;
if ( err )
goto restore_opts ;
}
2022-05-17 20:27:55 +03:00
sbi - > s_mount_state = ( le16_to_cpu ( es - > s_state ) &
~ EXT4_FC_REPLAY ) ;
2018-05-14 06:02:19 +03:00
err = ext4_setup_super ( sb , es , 0 ) ;
if ( err )
goto restore_opts ;
sb - > s_flags & = ~ SB_RDONLY ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_mmp ( sb ) )
2011-05-25 02:31:25 +04:00
if ( ext4_multi_mount_protect ( sb ,
le64_to_cpu ( es - > s_mmp_block ) ) ) {
err = - EROFS ;
goto restore_opts ;
}
2021-08-24 06:49:29 +03:00
# ifdef CONFIG_QUOTA
2010-05-19 15:16:40 +04:00
enable_quota = 1 ;
2021-08-24 06:49:29 +03:00
# endif
2006-10-11 12:20:50 +04:00
}
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/*
* Reinitialize lazy itable initialization thread based on
* current settings
*/
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) | | ! test_opt ( sb , INIT_INODE_TABLE ) )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_unregister_li_request ( sb ) ;
else {
ext4_group_t first_not_zeroed ;
first_not_zeroed = ext4_has_uninit_itable ( sb ) ;
ext4_register_li_request ( sb , first_not_zeroed ) ;
}
2020-07-28 16:04:37 +03:00
/*
* Handle creation of system zone data early because it can fail .
* Releasing of existing data is done when we are sure remount will
* succeed .
*/
2020-09-24 06:03:43 +03:00
if ( test_opt ( sb , BLOCK_VALIDITY ) & & ! sbi - > s_system_blks ) {
2020-07-28 16:04:37 +03:00
err = ext4_setup_system_zone ( sb ) ;
if ( err )
goto restore_opts ;
}
2020-07-28 16:04:32 +03:00
2018-05-14 06:02:19 +03:00
if ( sbi - > s_journal = = NULL & & ! ( old_sb_flags & SB_RDONLY ) ) {
2020-12-16 13:18:38 +03:00
err = ext4_commit_super ( sb ) ;
2018-05-14 06:02:19 +03:00
if ( err )
goto restore_opts ;
}
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
/* Release old quota file names */
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2013-01-25 08:24:58 +04:00
kfree ( old_opts . s_qf_names [ i ] ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( enable_quota ) {
if ( sb_any_quota_suspended ( sb ) )
dquot_resume ( sb , - 1 ) ;
2015-10-17 23:18:43 +03:00
else if ( ext4_has_feature_quota ( sb ) ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
err = ext4_enable_quotas ( sb ) ;
2012-08-18 03:08:42 +04:00
if ( err )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
goto restore_opts ;
}
}
2006-10-11 12:20:50 +04:00
# endif
2020-09-24 06:03:43 +03:00
if ( ! test_opt ( sb , BLOCK_VALIDITY ) & & sbi - > s_system_blks )
2020-07-28 16:04:37 +03:00
ext4_release_system_zone ( sb ) ;
2010-05-16 20:00:00 +04:00
2021-07-02 19:45:02 +03:00
if ( ! ext4_has_feature_mmp ( sb ) | | sb_rdonly ( sb ) )
ext4_stop_mmpd ( sbi ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
restore_opts :
sb - > s_flags = old_sb_flags ;
sbi - > s_mount_opt = old_opts . s_mount_opt ;
2010-12-16 04:30:48 +03:00
sbi - > s_mount_opt2 = old_opts . s_mount_opt2 ;
2006-10-11 12:20:50 +04:00
sbi - > s_resuid = old_opts . s_resuid ;
sbi - > s_resgid = old_opts . s_resgid ;
sbi - > s_commit_interval = old_opts . s_commit_interval ;
2009-01-04 04:27:38 +03:00
sbi - > s_min_batch_time = old_opts . s_min_batch_time ;
sbi - > s_max_batch_time = old_opts . s_max_batch_time ;
2020-09-24 06:03:43 +03:00
if ( ! test_opt ( sb , BLOCK_VALIDITY ) & & sbi - > s_system_blks )
2020-07-28 16:04:37 +03:00
ext4_release_system_zone ( sb ) ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
sbi - > s_jquota_fmt = old_opts . s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
2018-10-12 16:28:09 +03:00
to_free [ i ] = get_qf_name ( sb , sbi , i ) ;
rcu_assign_pointer ( sbi - > s_qf_names [ i ] , old_opts . s_qf_names [ i ] ) ;
2006-10-11 12:20:50 +04:00
}
2018-10-12 16:28:09 +03:00
synchronize_rcu ( ) ;
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
kfree ( to_free [ i ] ) ;
2006-10-11 12:20:50 +04:00
# endif
2021-07-02 19:45:02 +03:00
if ( ! ext4_has_feature_mmp ( sb ) | | sb_rdonly ( sb ) )
ext4_stop_mmpd ( sbi ) ;
2006-10-11 12:20:50 +04:00
return err ;
}
2021-10-27 17:18:56 +03:00
static int ext4_reconfigure ( struct fs_context * fc )
2021-10-27 17:18:53 +03:00
{
2021-10-27 17:18:56 +03:00
struct super_block * sb = fc - > root - > d_sb ;
2021-10-27 17:18:53 +03:00
int ret ;
2021-10-27 17:18:56 +03:00
fc - > s_fs_info = EXT4_SB ( sb ) ;
2021-10-27 17:18:53 +03:00
2021-10-27 17:18:56 +03:00
ret = ext4_check_opt_consistency ( fc , sb ) ;
2021-10-27 17:18:53 +03:00
if ( ret < 0 )
2021-10-27 17:18:56 +03:00
return ret ;
2021-10-27 17:18:53 +03:00
2021-12-22 13:45:17 +03:00
ret = __ext4_remount ( fc , sb ) ;
2021-10-27 17:18:53 +03:00
if ( ret < 0 )
2021-10-27 17:18:56 +03:00
return ret ;
2021-10-27 17:18:53 +03:00
2021-10-27 17:18:56 +03:00
ext4_msg ( sb , KERN_INFO , " re-mounted. Quota mode: %s. " ,
ext4_quota_mode ( sb ) ) ;
2021-10-27 17:18:53 +03:00
return 0 ;
}
2016-01-09 00:01:22 +03:00
# ifdef CONFIG_QUOTA
static int ext4_statfs_project ( struct super_block * sb ,
kprojid_t projid , struct kstatfs * buf )
{
struct kqid qid ;
struct dquot * dquot ;
u64 limit ;
u64 curblock ;
qid = make_kqid_projid ( projid ) ;
dquot = dqget ( sb , qid ) ;
if ( IS_ERR ( dquot ) )
return PTR_ERR ( dquot ) ;
2017-08-07 14:19:50 +03:00
spin_lock ( & dquot - > dq_dqb_lock ) ;
2016-01-09 00:01:22 +03:00
2020-02-10 11:24:45 +03:00
limit = min_not_zero ( dquot - > dq_dqb . dqb_bsoftlimit ,
dquot - > dq_dqb . dqb_bhardlimit ) ;
2019-10-16 05:25:01 +03:00
limit > > = sb - > s_blocksize_bits ;
2016-01-09 00:01:22 +03:00
if ( limit & & buf - > f_blocks > limit ) {
2018-05-21 05:49:54 +03:00
curblock = ( dquot - > dq_dqb . dqb_curspace +
dquot - > dq_dqb . dqb_rsvspace ) > > sb - > s_blocksize_bits ;
2016-01-09 00:01:22 +03:00
buf - > f_blocks = limit ;
buf - > f_bfree = buf - > f_bavail =
( buf - > f_blocks > curblock ) ?
( buf - > f_blocks - curblock ) : 0 ;
}
2020-02-10 11:24:45 +03:00
limit = min_not_zero ( dquot - > dq_dqb . dqb_isoftlimit ,
dquot - > dq_dqb . dqb_ihardlimit ) ;
2016-01-09 00:01:22 +03:00
if ( limit & & buf - > f_files > limit ) {
buf - > f_files = limit ;
buf - > f_ffree =
( buf - > f_files > dquot - > dq_dqb . dqb_curinodes ) ?
( buf - > f_files - dquot - > dq_dqb . dqb_curinodes ) : 0 ;
}
2017-08-07 14:19:50 +03:00
spin_unlock ( & dquot - > dq_dqb_lock ) ;
2016-01-09 00:01:22 +03:00
dqput ( dquot ) ;
return 0 ;
}
# endif
2008-07-27 00:15:44 +04:00
static int ext4_statfs ( struct dentry * dentry , struct kstatfs * buf )
2006-10-11 12:20:50 +04:00
{
struct super_block * sb = dentry - > d_sb ;
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2013-04-10 06:11:22 +04:00
ext4_fsblk_t overhead = 0 , resv_blocks ;
2011-05-25 02:30:07 +04:00
s64 bfree ;
2013-04-10 06:11:22 +04:00
resv_blocks = EXT4_C2B ( sbi , atomic64_read ( & sbi - > s_resv_clusters ) ) ;
2006-10-11 12:20:50 +04:00
2012-07-10 00:27:05 +04:00
if ( ! test_opt ( sb , MINIX_DF ) )
overhead = sbi - > s_overhead ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
buf - > f_type = EXT4_SUPER_MAGIC ;
2006-10-11 12:20:50 +04:00
buf - > f_bsize = sb - > s_blocksize ;
2012-11-08 19:33:36 +04:00
buf - > f_blocks = ext4_blocks_count ( es ) - EXT4_C2B ( sbi , overhead ) ;
2011-09-10 02:56:51 +04:00
bfree = percpu_counter_sum_positive ( & sbi - > s_freeclusters_counter ) -
percpu_counter_sum_positive ( & sbi - > s_dirtyclusters_counter ) ;
2011-05-25 02:30:07 +04:00
/* prevent underflow in case that few free space is available */
2011-09-10 02:56:51 +04:00
buf - > f_bfree = EXT4_C2B ( sbi , max_t ( s64 , bfree , 0 ) ) ;
2013-04-10 06:11:22 +04:00
buf - > f_bavail = buf - > f_bfree -
( ext4_r_blocks_count ( es ) + resv_blocks ) ;
if ( buf - > f_bfree < ( ext4_r_blocks_count ( es ) + resv_blocks ) )
2006-10-11 12:20:50 +04:00
buf - > f_bavail = 0 ;
buf - > f_files = le32_to_cpu ( es - > s_inodes_count ) ;
2007-10-17 10:25:44 +04:00
buf - > f_ffree = percpu_counter_sum_positive ( & sbi - > s_freeinodes_counter ) ;
2006-10-11 12:20:53 +04:00
buf - > f_namelen = EXT4_NAME_LEN ;
2021-03-22 20:39:43 +03:00
buf - > f_fsid = uuid_to_fsid ( es - > s_uuid ) ;
2009-06-04 01:59:28 +04:00
2016-01-09 00:01:22 +03:00
# ifdef CONFIG_QUOTA
if ( ext4_test_inode_flag ( dentry - > d_inode , EXT4_INODE_PROJINHERIT ) & &
sb_has_quota_limits_enabled ( sb , PRJQUOTA ) )
ext4_statfs_project ( sb , EXT4_I ( dentry - > d_inode ) - > i_projid , buf ) ;
# endif
2006-10-11 12:20:50 +04:00
return 0 ;
}
# ifdef CONFIG_QUOTA
2017-06-08 15:39:48 +03:00
/*
* Helper functions so that transaction is started before we acquire dqio_sem
* to keep correct lock ordering of transaction > dqio_sem
*/
2006-10-11 12:20:50 +04:00
static inline struct inode * dquot_to_inode ( struct dquot * dquot )
{
2012-09-16 14:56:19 +04:00
return sb_dqopt ( dquot - > dq_sb ) - > files [ dquot - > dq_id . type ] ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static int ext4_write_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
struct inode * inode ;
inode = dquot_to_inode ( dquot ) ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_TRANS_BLOCKS ( dquot - > dq_sb ) ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_commit ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_acquire_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( dquot_to_inode ( dquot ) , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_INIT_BLOCKS ( dquot - > dq_sb ) ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_acquire ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_release_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( dquot_to_inode ( dquot ) , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_DEL_BLOCKS ( dquot - > dq_sb ) ) ;
2007-09-12 02:23:29 +04:00
if ( IS_ERR ( handle ) ) {
/* Release dquot anyway to avoid endless cycle in dqput() */
dquot_release ( dquot ) ;
2006-10-11 12:20:50 +04:00
return PTR_ERR ( handle ) ;
2007-09-12 02:23:29 +04:00
}
2006-10-11 12:20:50 +04:00
ret = dquot_release ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_mark_dquot_dirty ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
2013-03-03 02:57:08 +04:00
struct super_block * sb = dquot - > dq_sb ;
2020-10-22 06:20:59 +03:00
if ( ext4_is_quota_journalled ( sb ) ) {
2006-10-11 12:20:50 +04:00
dquot_mark_dquot_dirty ( dquot ) ;
2006-10-11 12:20:53 +04:00
return ext4_write_dquot ( dquot ) ;
2006-10-11 12:20:50 +04:00
} else {
return dquot_mark_dquot_dirty ( dquot ) ;
}
}
2006-10-11 12:20:53 +04:00
static int ext4_write_info ( struct super_block * sb , int type )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
/* Data block + inode block */
2015-03-18 01:25:59 +03:00
handle = ext4_journal_start ( d_inode ( sb - > s_root ) , EXT4_HT_QUOTA , 2 ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_commit_info ( sb , type ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2016-04-01 08:31:28 +03:00
static void lockdep_set_quota_inode ( struct inode * inode , int subclass )
{
struct ext4_inode_info * ei = EXT4_I ( inode ) ;
/* The first argument of lockdep_set_subclass has to be
* * exactly * the same as the argument to init_rwsem ( ) - - - in
* this case , in init_once ( ) - - - or lockdep gets unhappy
* because the name of the lock is set using the
* stringification of the argument to init_rwsem ( ) .
*/
( void ) ei ; /* shut up clang warning if !CONFIG_LOCKDEP */
lockdep_set_subclass ( & ei - > i_data_sem , subclass ) ;
}
2006-10-11 12:20:50 +04:00
/*
* Standard function to be called on quota_on
*/
2006-10-11 12:20:53 +04:00
static int ext4_quota_on ( struct super_block * sb , int type , int format_id ,
2016-11-21 03:49:34 +03:00
const struct path * path )
2006-10-11 12:20:50 +04:00
{
int err ;
if ( ! test_opt ( sb , QUOTA ) )
return - EINVAL ;
2008-05-14 03:11:51 +04:00
2006-10-11 12:20:50 +04:00
/* Quotafile not on the same filesystem? */
2011-12-08 03:16:57 +04:00
if ( path - > dentry - > d_sb ! = sb )
2006-10-11 12:20:50 +04:00
return - EXDEV ;
2020-10-15 14:03:30 +03:00
/* Quota already enabled for this file? */
if ( IS_NOQUOTA ( d_inode ( path - > dentry ) ) )
return - EBUSY ;
2008-05-14 03:11:51 +04:00
/* Journaling quota? */
if ( EXT4_SB ( sb ) - > s_qf_names [ type ] ) {
2008-07-27 00:15:44 +04:00
/* Quotafile not in fs root? */
2010-09-15 19:38:58 +04:00
if ( path - > dentry - > d_parent ! = sb - > s_root )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" Quota file not on filesystem root. "
" Journaled quota will not work " ) ;
2017-08-03 12:25:55 +03:00
sb_dqopt ( sb ) - > flags | = DQUOT_NOLIST_DIRTY ;
} else {
/*
* Clear the flag just in case mount options changed since
* last time .
*/
sb_dqopt ( sb ) - > flags & = ~ DQUOT_NOLIST_DIRTY ;
2008-07-27 00:15:44 +04:00
}
2008-05-14 03:11:51 +04:00
/*
* When we journal data on quota file , we have to flush journal to see
* all updates to the file when we bypass pagecache . . .
*/
2009-01-07 08:06:22 +03:00
if ( EXT4_SB ( sb ) - > s_journal & &
2015-03-18 01:25:59 +03:00
ext4_should_journal_data ( d_inode ( path - > dentry ) ) ) {
2008-05-14 03:11:51 +04:00
/*
* We don ' t need to lock updates but journal_flush ( ) could
* otherwise be livelocked . . .
*/
jbd2_journal_lock_updates ( EXT4_SB ( sb ) - > s_journal ) ;
2021-05-18 18:13:25 +03:00
err = jbd2_journal_flush ( EXT4_SB ( sb ) - > s_journal , 0 ) ;
2008-05-14 03:11:51 +04:00
jbd2_journal_unlock_updates ( EXT4_SB ( sb ) - > s_journal ) ;
2010-09-15 19:38:58 +04:00
if ( err )
2008-10-11 04:29:21 +04:00
return err ;
2008-05-14 03:11:51 +04:00
}
2017-04-06 16:40:06 +03:00
2016-04-01 08:31:28 +03:00
lockdep_set_quota_inode ( path - > dentry - > d_inode , I_DATA_SEM_QUOTA ) ;
err = dquot_quota_on ( sb , type , format_id , path ) ;
2021-10-07 18:53:35 +03:00
if ( ! err ) {
2017-04-06 16:40:06 +03:00
struct inode * inode = d_inode ( path - > dentry ) ;
handle_t * handle ;
2017-04-24 17:49:16 +03:00
/*
* Set inode flags to prevent userspace from messing with quota
* files . If this fails , we return success anyway since quotas
* are already enabled and this is not a hard failure .
*/
2017-04-06 16:40:06 +03:00
inode_lock ( inode ) ;
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA , 1 ) ;
if ( IS_ERR ( handle ) )
goto unlock_inode ;
EXT4_I ( inode ) - > i_flags | = EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL ;
inode_set_flags ( inode , S_NOATIME | S_IMMUTABLE ,
S_NOATIME | S_IMMUTABLE ) ;
2020-04-27 04:34:37 +03:00
err = ext4_mark_inode_dirty ( handle , inode ) ;
2017-04-06 16:40:06 +03:00
ext4_journal_stop ( handle ) ;
unlock_inode :
inode_unlock ( inode ) ;
2021-10-07 18:53:35 +03:00
if ( err )
dquot_quota_off ( sb , type ) ;
2017-04-06 16:40:06 +03:00
}
2021-10-07 18:53:35 +03:00
if ( err )
lockdep_set_quota_inode ( path - > dentry - > d_inode ,
I_DATA_SEM_NORMAL ) ;
2016-04-01 08:31:28 +03:00
return err ;
2006-10-11 12:20:50 +04:00
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
static int ext4_quota_enable ( struct super_block * sb , int type , int format_id ,
unsigned int flags )
{
int err ;
struct inode * qf_inode ;
2014-09-11 19:15:15 +04:00
unsigned long qf_inums [ EXT4_MAXQUOTAS ] = {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_usr_quota_inum ) ,
2016-01-09 00:01:22 +03:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_grp_quota_inum ) ,
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_prj_quota_inum )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
} ;
2015-10-17 23:18:43 +03:00
BUG_ON ( ! ext4_has_feature_quota ( sb ) ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( ! qf_inums [ type ] )
return - EPERM ;
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 20:29:13 +03:00
qf_inode = ext4_iget ( sb , qf_inums [ type ] , EXT4_IGET_SPECIAL ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( IS_ERR ( qf_inode ) ) {
ext4_error ( sb , " Bad quota inode # %lu " , qf_inums [ type ] ) ;
return PTR_ERR ( qf_inode ) ;
}
2013-04-09 17:21:41 +04:00
/* Don't account quota for quota files to avoid recursion */
qf_inode - > i_flags | = S_NOQUOTA ;
2016-04-01 08:31:28 +03:00
lockdep_set_quota_inode ( qf_inode , I_DATA_SEM_QUOTA ) ;
2019-11-01 20:55:38 +03:00
err = dquot_load_quota_inode ( qf_inode , type , format_id , flags ) ;
2016-04-01 08:31:28 +03:00
if ( err )
lockdep_set_quota_inode ( qf_inode , I_DATA_SEM_NORMAL ) ;
2018-12-04 07:28:02 +03:00
iput ( qf_inode ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
return err ;
}
/* Enable usage tracking for all quota types. */
2021-08-16 12:57:05 +03:00
int ext4_enable_quotas ( struct super_block * sb )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
{
int type , err = 0 ;
2014-09-11 19:15:15 +04:00
unsigned long qf_inums [ EXT4_MAXQUOTAS ] = {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_usr_quota_inum ) ,
2016-01-09 00:01:22 +03:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_grp_quota_inum ) ,
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_prj_quota_inum )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
} ;
2016-09-06 06:08:16 +03:00
bool quota_mopt [ EXT4_MAXQUOTAS ] = {
test_opt ( sb , USRQUOTA ) ,
test_opt ( sb , GRPQUOTA ) ,
test_opt ( sb , PRJQUOTA ) ,
} ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
2017-08-03 12:25:55 +03:00
sb_dqopt ( sb ) - > flags | = DQUOT_QUOTA_SYS_FILE | DQUOT_NOLIST_DIRTY ;
2014-09-11 19:15:15 +04:00
for ( type = 0 ; type < EXT4_MAXQUOTAS ; type + + ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( qf_inums [ type ] ) {
err = ext4_quota_enable ( sb , type , QFMT_VFS_V1 ,
2016-09-06 06:08:16 +03:00
DQUOT_USAGE_ENABLED |
( quota_mopt [ type ] ? DQUOT_LIMITS_ENABLED : 0 ) ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( err ) {
ext4_warning ( sb ,
2013-01-25 08:24:54 +04:00
" Failed to enable quota tracking "
" (type=%d, err=%d). Please run "
" e2fsck to fix. " , type , err ) ;
2021-10-07 18:53:36 +03:00
for ( type - - ; type > = 0 ; type - - ) {
struct inode * inode ;
inode = sb_dqopt ( sb ) - > files [ type ] ;
if ( inode )
inode = igrab ( inode ) ;
2018-07-29 22:51:52 +03:00
dquot_quota_off ( sb , type ) ;
2021-10-07 18:53:36 +03:00
if ( inode ) {
lockdep_set_quota_inode ( inode ,
I_DATA_SEM_NORMAL ) ;
iput ( inode ) ;
}
}
2018-07-29 22:51:52 +03:00
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
return err ;
}
}
}
return 0 ;
}
2010-08-02 01:48:36 +04:00
static int ext4_quota_off ( struct super_block * sb , int type )
{
2011-04-04 23:33:39 +04:00
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
handle_t * handle ;
2017-04-06 16:40:06 +03:00
int err ;
2011-04-04 23:33:39 +04:00
2010-11-08 21:47:33 +03:00
/* Force all delayed allocation blocks to be allocated.
* Caller already holds s_umount sem */
if ( test_opt ( sb , DELALLOC ) )
2010-08-02 01:48:36 +04:00
sync_filesystem ( sb ) ;
2017-04-06 16:40:06 +03:00
if ( ! inode | | ! igrab ( inode ) )
2011-05-16 17:59:13 +04:00
goto out ;
2017-04-06 16:40:06 +03:00
err = dquot_quota_off ( sb , type ) ;
2017-05-22 05:31:23 +03:00
if ( err | | ext4_has_feature_quota ( sb ) )
2017-04-06 16:40:06 +03:00
goto out_put ;
inode_lock ( inode ) ;
2017-04-24 17:49:16 +03:00
/*
* Update modification times of quota files when userspace can
* start looking at them . If we fail , we return success anyway since
* this is not a hard failure and quotas are already disabled .
*/
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA , 1 ) ;
2020-04-27 04:34:37 +03:00
if ( IS_ERR ( handle ) ) {
err = PTR_ERR ( handle ) ;
2017-04-06 16:40:06 +03:00
goto out_unlock ;
2020-04-27 04:34:37 +03:00
}
2017-04-06 16:40:06 +03:00
EXT4_I ( inode ) - > i_flags & = ~ ( EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL ) ;
inode_set_flags ( inode , 0 , S_NOATIME | S_IMMUTABLE ) ;
2016-11-15 05:40:10 +03:00
inode - > i_mtime = inode - > i_ctime = current_time ( inode ) ;
2020-04-27 04:34:37 +03:00
err = ext4_mark_inode_dirty ( handle , inode ) ;
2011-04-04 23:33:39 +04:00
ext4_journal_stop ( handle ) ;
2017-04-06 16:40:06 +03:00
out_unlock :
inode_unlock ( inode ) ;
out_put :
2017-05-22 05:31:23 +03:00
lockdep_set_quota_inode ( inode , I_DATA_SEM_NORMAL ) ;
2017-04-06 16:40:06 +03:00
iput ( inode ) ;
return err ;
2011-04-04 23:33:39 +04:00
out :
2010-08-02 01:48:36 +04:00
return dquot_quota_off ( sb , type ) ;
}
2006-10-11 12:20:50 +04:00
/* Read data from quotafile - avoid pagecache and such because we cannot afford
* acquiring the locks . . . As quota files are never truncated and quota code
2011-03-31 05:57:33 +04:00
* itself serializes the operations ( and no one else should touch the files )
2006-10-11 12:20:50 +04:00
* we don ' t have to be afraid of races */
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_read ( struct super_block * sb , int type , char * data ,
2006-10-11 12:20:50 +04:00
size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
2008-01-29 07:58:27 +03:00
ext4_lblk_t blk = off > > EXT4_BLOCK_SIZE_BITS ( sb ) ;
2006-10-11 12:20:50 +04:00
int offset = off & ( sb - > s_blocksize - 1 ) ;
int tocopy ;
size_t toread ;
struct buffer_head * bh ;
loff_t i_size = i_size_read ( inode ) ;
if ( off > i_size )
return 0 ;
if ( off + len > i_size )
len = i_size - off ;
toread = len ;
while ( toread > 0 ) {
tocopy = sb - > s_blocksize - offset < toread ?
sb - > s_blocksize - offset : toread ;
2014-08-30 04:52:15 +04:00
bh = ext4_bread ( NULL , inode , blk , 0 ) ;
if ( IS_ERR ( bh ) )
return PTR_ERR ( bh ) ;
2006-10-11 12:20:50 +04:00
if ( ! bh ) /* A hole? */
memset ( data , 0 , tocopy ) ;
else
memcpy ( data , bh - > b_data + offset , tocopy ) ;
brelse ( bh ) ;
offset = 0 ;
toread - = tocopy ;
data + = tocopy ;
blk + + ;
}
return len ;
}
/* Write to quotafile (we know the transaction is already started and has
* enough credits ) */
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_write ( struct super_block * sb , int type ,
2006-10-11 12:20:50 +04:00
const char * data , size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
2008-01-29 07:58:27 +03:00
ext4_lblk_t blk = off > > EXT4_BLOCK_SIZE_BITS ( sb ) ;
2020-04-27 04:34:37 +03:00
int err = 0 , err2 = 0 , offset = off & ( sb - > s_blocksize - 1 ) ;
2015-06-21 08:25:29 +03:00
int retries = 0 ;
2006-10-11 12:20:50 +04:00
struct buffer_head * bh ;
handle_t * handle = journal_current_handle ( ) ;
2021-12-23 04:55:06 +03:00
if ( ! handle ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " Quota write (off=%llu, len=%llu) "
" cancelled because transaction is not started " ,
2007-09-12 02:23:29 +04:00
( unsigned long long ) off , ( unsigned long long ) len ) ;
return - EIO ;
}
2010-03-02 16:08:51 +03:00
/*
* Since we account only one data block in transaction credits ,
* then it is impossible to cross a block boundary .
*/
if ( sb - > s_blocksize - offset < len ) {
ext4_msg ( sb , KERN_WARNING , " Quota write (off=%llu, len=%llu) "
" cancelled because not block aligned " ,
( unsigned long long ) off , ( unsigned long long ) len ) ;
return - EIO ;
}
2015-06-21 08:25:29 +03:00
do {
bh = ext4_bread ( handle , inode , blk ,
EXT4_GET_BLOCKS_CREATE |
EXT4_GET_BLOCKS_METADATA_NOFAIL ) ;
2020-02-04 04:37:45 +03:00
} while ( PTR_ERR ( bh ) = = - ENOSPC & &
2015-06-21 08:25:29 +03:00
ext4_should_retry_alloc ( inode - > i_sb , & retries ) ) ;
2014-08-30 04:52:15 +04:00
if ( IS_ERR ( bh ) )
return PTR_ERR ( bh ) ;
2010-03-02 16:08:51 +03:00
if ( ! bh )
goto out ;
2014-05-13 06:06:43 +04:00
BUFFER_TRACE ( bh , " get write access " ) ;
2021-08-16 12:57:04 +03:00
err = ext4_journal_get_write_access ( handle , sb , bh , EXT4_JTR_NONE ) ;
2010-07-27 19:56:07 +04:00
if ( err ) {
brelse ( bh ) ;
2014-08-30 04:52:15 +04:00
return err ;
2006-10-11 12:20:50 +04:00
}
2010-03-02 16:08:51 +03:00
lock_buffer ( bh ) ;
memcpy ( bh - > b_data + offset , data , len ) ;
flush_dcache_page ( bh - > b_page ) ;
unlock_buffer ( bh ) ;
2010-07-27 19:56:07 +04:00
err = ext4_handle_dirty_metadata ( handle , NULL , bh ) ;
2010-03-02 16:08:51 +03:00
brelse ( bh ) ;
2006-10-11 12:20:50 +04:00
out :
2010-03-02 16:08:51 +03:00
if ( inode - > i_size < off + len ) {
i_size_write ( inode , off + len ) ;
2006-10-11 12:20:53 +04:00
EXT4_I ( inode ) - > i_disksize = inode - > i_size ;
2020-04-27 04:34:37 +03:00
err2 = ext4_mark_inode_dirty ( handle , inode ) ;
if ( unlikely ( err2 & & ! err ) )
err = err2 ;
2006-10-11 12:20:50 +04:00
}
2020-04-27 04:34:37 +03:00
return err ? err : len ;
2006-10-11 12:20:50 +04:00
}
# endif
2015-06-18 17:52:29 +03:00
# if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
2009-12-07 22:08:51 +03:00
static inline void register_as_ext2 ( void )
{
int err = register_filesystem ( & ext2_fs_type ) ;
if ( err )
printk ( KERN_WARNING
" EXT4-fs: Unable to register as ext2 (%d) \n " , err ) ;
}
static inline void unregister_as_ext2 ( void )
{
unregister_filesystem ( & ext2_fs_type ) ;
}
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb )
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext2_incompat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) )
2011-04-19 01:29:14 +04:00
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext2_ro_compat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
return 1 ;
}
2009-12-07 22:08:51 +03:00
# else
static inline void register_as_ext2 ( void ) { }
static inline void unregister_as_ext2 ( void ) { }
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb ) { return 0 ; }
2009-12-07 22:08:51 +03:00
# endif
static inline void register_as_ext3 ( void )
{
int err = register_filesystem ( & ext3_fs_type ) ;
if ( err )
printk ( KERN_WARNING
" EXT4-fs: Unable to register as ext3 (%d) \n " , err ) ;
}
static inline void unregister_as_ext3 ( void )
{
unregister_filesystem ( & ext3_fs_type ) ;
}
2011-04-19 01:29:14 +04:00
static inline int ext3_feature_set_ok ( struct super_block * sb )
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext3_incompat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
2017-07-17 10:45:34 +03:00
if ( sb_rdonly ( sb ) )
2011-04-19 01:29:14 +04:00
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext3_ro_compat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
return 1 ;
}
2009-12-07 22:08:51 +03:00
2008-10-11 04:02:48 +04:00
static struct file_system_type ext4_fs_type = {
2021-10-27 17:18:56 +03:00
. owner = THIS_MODULE ,
. name = " ext4 " ,
. init_fs_context = ext4_init_fs_context ,
. parameters = ext4_param_specs ,
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV | FS_ALLOW_IDMAP ,
2008-10-11 04:02:48 +04:00
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext4 " ) ;
2008-10-11 04:02:48 +04:00
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
/* Shared across all ext4 file systems */
wait_queue_head_t ext4__ioend_wq [ EXT4_WQ_HASH_SZ ] ;
2010-10-28 05:30:14 +04:00
static int __init ext4_init_fs ( void )
2006-10-11 12:20:50 +04:00
{
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
int i , err ;
2008-01-29 08:19:52 +03:00
2015-08-15 21:59:44 +03:00
ratelimit_state_init ( & ext4_mount_msg_ratelimit , 30 * HZ , 64 ) ;
2012-03-21 06:05:02 +04:00
ext4_li_info = NULL ;
ext4: ensure Inode flags consistency are checked at build time
Flags being used by atomic operations in inode flags (e.g.
ext4_test_inode_flag(), should be consistent with that actually stored
in inodes, i.e.: EXT4_XXX_FL.
It ensures that this consistency is checked at build-time, not at
run-time.
Currently, the flags consistency are being checked at run-time, but,
there is no real reason to not do a build-time check instead of a
run-time check. The code is comparing macro defined values with enum
type variables, where both are constants, so, there is no problem in
comparing constants at build-time.
enum variables are treated as constants by the C compiler, according
to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
sec. 6.2.5, item 16), so, there is no real problem in comparing an
enumeration type at build time
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-11 01:30:45 +04:00
/* Build-time check for flags consistency */
2010-05-17 06:00:00 +04:00
ext4_check_flag_values ( ) ;
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
2016-03-09 06:44:50 +03:00
for ( i = 0 ; i < EXT4_WQ_HASH_SZ ; i + + )
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
init_waitqueue_head ( & ext4__ioend_wq [ i ] ) ;
2012-11-09 06:57:32 +04:00
err = ext4_init_es ( ) ;
2009-05-17 23:38:01 +04:00
if ( err )
return err ;
2012-11-09 06:57:32 +04:00
2018-10-01 21:17:41 +03:00
err = ext4_init_pending ( ) ;
2019-07-22 19:26:24 +03:00
if ( err )
goto out7 ;
err = ext4_init_post_read_processing ( ) ;
2018-10-01 21:17:41 +03:00
if ( err )
goto out6 ;
2012-11-09 06:57:32 +04:00
err = ext4_init_pageio ( ) ;
if ( err )
2015-09-23 19:44:17 +03:00
goto out5 ;
2012-11-09 06:57:32 +04:00
2010-10-28 05:30:14 +04:00
err = ext4_init_system_zone ( ) ;
2010-10-28 05:30:10 +04:00
if ( err )
2015-09-23 19:44:17 +03:00
goto out4 ;
2010-10-28 05:30:05 +04:00
2015-09-23 19:44:17 +03:00
err = ext4_init_sysfs ( ) ;
2011-02-03 22:33:49 +03:00
if ( err )
2015-09-23 19:44:17 +03:00
goto out3 ;
2010-10-28 05:30:05 +04:00
2010-10-28 05:30:14 +04:00
err = ext4_init_mballoc ( ) ;
2008-01-29 08:19:52 +03:00
if ( err )
goto out2 ;
2006-10-11 12:20:50 +04:00
err = init_inodecache ( ) ;
if ( err )
goto out1 ;
2020-10-15 23:37:57 +03:00
err = ext4_fc_init_dentry_cache ( ) ;
if ( err )
goto out05 ;
2009-12-07 22:08:51 +03:00
register_as_ext3 ( ) ;
2011-04-19 01:29:14 +04:00
register_as_ext2 ( ) ;
2008-10-11 04:02:48 +04:00
err = register_filesystem ( & ext4_fs_type ) ;
2006-10-11 12:20:50 +04:00
if ( err )
goto out ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2006-10-11 12:20:50 +04:00
return 0 ;
out :
2009-12-07 22:08:51 +03:00
unregister_as_ext2 ( ) ;
unregister_as_ext3 ( ) ;
2021-12-23 19:44:36 +03:00
ext4_fc_destroy_dentry_cache ( ) ;
2020-10-15 23:37:57 +03:00
out05 :
2006-10-11 12:20:50 +04:00
destroy_inodecache ( ) ;
out1 :
2010-10-28 05:30:14 +04:00
ext4_exit_mballoc ( ) ;
2014-03-19 03:24:49 +04:00
out2 :
2015-09-23 19:44:17 +03:00
ext4_exit_sysfs ( ) ;
out3 :
2010-10-28 05:30:14 +04:00
ext4_exit_system_zone ( ) ;
2015-09-23 19:44:17 +03:00
out4 :
2010-10-28 05:30:14 +04:00
ext4_exit_pageio ( ) ;
2015-09-23 19:44:17 +03:00
out5 :
2019-07-22 19:26:24 +03:00
ext4_exit_post_read_processing ( ) ;
2018-10-01 21:17:41 +03:00
out6 :
2019-07-22 19:26:24 +03:00
ext4_exit_pending ( ) ;
out7 :
2012-11-09 06:57:32 +04:00
ext4_exit_es ( ) ;
2006-10-11 12:20:50 +04:00
return err ;
}
2010-10-28 05:30:14 +04:00
static void __exit ext4_exit_fs ( void )
2006-10-11 12:20:50 +04:00
{
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_destroy_lazyinit_thread ( ) ;
2009-12-07 22:08:51 +03:00
unregister_as_ext2 ( ) ;
unregister_as_ext3 ( ) ;
2008-10-11 04:02:48 +04:00
unregister_filesystem ( & ext4_fs_type ) ;
2021-12-23 19:44:36 +03:00
ext4_fc_destroy_dentry_cache ( ) ;
2006-10-11 12:20:50 +04:00
destroy_inodecache ( ) ;
2010-10-28 05:30:14 +04:00
ext4_exit_mballoc ( ) ;
2015-09-23 19:44:17 +03:00
ext4_exit_sysfs ( ) ;
2010-10-28 05:30:14 +04:00
ext4_exit_system_zone ( ) ;
ext4_exit_pageio ( ) ;
2019-07-22 19:26:24 +03:00
ext4_exit_post_read_processing ( ) ;
2013-07-26 23:21:11 +04:00
ext4_exit_es ( ) ;
2018-10-01 21:17:41 +03:00
ext4_exit_pending ( ) ;
2006-10-11 12:20:50 +04:00
}
MODULE_AUTHOR ( " Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others " ) ;
2009-01-06 22:53:16 +03:00
MODULE_DESCRIPTION ( " Fourth Extended Filesystem " ) ;
2006-10-11 12:20:50 +04:00
MODULE_LICENSE ( " GPL " ) ;
2018-04-26 07:44:46 +03:00
MODULE_SOFTDEP ( " pre: crc32c " ) ;
2010-10-28 05:30:14 +04:00
module_init ( ext4_init_fs )
module_exit ( ext4_exit_fs )