2006-10-11 12:20:50 +04:00
/*
2006-10-11 12:20:53 +04:00
* linux / fs / ext4 / super . c
2006-10-11 12:20:50 +04:00
*
* Copyright ( C ) 1992 , 1993 , 1994 , 1995
* Remy Card ( card @ masi . ibp . fr )
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie ( Paris VI )
*
* from
*
* linux / fs / minix / inode . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
* Big - endian to little - endian byte - swapping / bitmaps by
* David S . Miller ( davem @ caip . rutgers . edu ) , 1995
*/
# include <linux/module.h>
# include <linux/string.h>
# include <linux/fs.h>
# include <linux/time.h>
2009-04-28 06:48:48 +04:00
# include <linux/vmalloc.h>
2006-10-11 12:20:50 +04:00
# include <linux/slab.h>
# include <linux/init.h>
# include <linux/blkdev.h>
2015-05-23 00:13:32 +03:00
# include <linux/backing-dev.h>
2006-10-11 12:20:50 +04:00
# include <linux/parser.h>
# include <linux/buffer_head.h>
2007-07-17 15:04:28 +04:00
# include <linux/exportfs.h>
2006-10-11 12:20:50 +04:00
# include <linux/vfs.h>
# include <linux/random.h>
# include <linux/mount.h>
# include <linux/namei.h>
# include <linux/quotaops.h>
# include <linux/seq_file.h>
2009-03-31 17:10:09 +04:00
# include <linux/ctype.h>
2007-07-18 17:11:02 +04:00
# include <linux/log2.h>
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
# include <linux/crc16.h>
2017-05-08 20:55:27 +03:00
# include <linux/dax.h>
2011-05-26 20:02:03 +04:00
# include <linux/cleancache.h>
2016-12-24 22:46:01 +03:00
# include <linux/uaccess.h>
2006-10-11 12:20:50 +04:00
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
# include <linux/kthread.h>
# include <linux/freezer.h>
2008-04-30 02:13:32 +04:00
# include "ext4.h"
2012-11-28 22:03:30 +04:00
# include "ext4_extents.h" /* Needed for trace points definition */
2008-04-30 02:13:32 +04:00
# include "ext4_jbd2.h"
2006-10-11 12:20:50 +04:00
# include "xattr.h"
# include "acl.h"
2009-09-15 06:59:50 +04:00
# include "mballoc.h"
2017-04-30 07:36:53 +03:00
# include "fsmap.h"
2006-10-11 12:20:50 +04:00
2009-06-17 19:48:11 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/ext4.h>
2011-02-23 20:22:49 +03:00
static struct ext4_lazy_init * ext4_li_info ;
static struct mutex ext4_li_mtx ;
2015-08-15 21:59:44 +03:00
static struct ratelimit_state ext4_mount_msg_ratelimit ;
2008-09-23 17:18:24 +04:00
2006-10-11 12:20:53 +04:00
static int ext4_load_journal ( struct super_block * , struct ext4_super_block * ,
2006-10-11 12:20:50 +04:00
unsigned long journal_devnum ) ;
2012-03-04 08:20:50 +04:00
static int ext4_show_options ( struct seq_file * seq , struct dentry * root ) ;
2009-05-01 08:33:44 +04:00
static int ext4_commit_super ( struct super_block * sb , int sync ) ;
2008-07-27 00:15:44 +04:00
static void ext4_mark_recovery_complete ( struct super_block * sb ,
struct ext4_super_block * es ) ;
static void ext4_clear_journal_err ( struct super_block * sb ,
struct ext4_super_block * es ) ;
2006-10-11 12:20:53 +04:00
static int ext4_sync_fs ( struct super_block * sb , int wait ) ;
2008-07-27 00:15:44 +04:00
static int ext4_remount ( struct super_block * sb , int * flags , char * data ) ;
static int ext4_statfs ( struct dentry * dentry , struct kstatfs * buf ) ;
2009-01-10 03:40:58 +03:00
static int ext4_unfreeze ( struct super_block * sb ) ;
static int ext4_freeze ( struct super_block * sb ) ;
2010-07-25 00:46:55 +04:00
static struct dentry * ext4_mount ( struct file_system_type * fs_type , int flags ,
const char * dev_name , void * data ) ;
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb ) ;
static inline int ext3_feature_set_ok ( struct super_block * sb ) ;
2011-02-28 08:53:45 +03:00
static int ext4_feature_set_ok ( struct super_block * sb , int readonly ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
static void ext4_destroy_lazyinit_thread ( void ) ;
static void ext4_unregister_li_request ( struct super_block * sb ) ;
2011-02-03 22:33:15 +03:00
static void ext4_clear_request_list ( void ) ;
2016-09-30 09:05:09 +03:00
static struct inode * ext4_get_journal_inode ( struct super_block * sb ,
unsigned int journal_inum ) ;
2006-10-11 12:20:50 +04:00
2015-12-07 22:35:49 +03:00
/*
* Lock ordering
*
* Note the difference between i_mmap_sem ( EXT4_I ( inode ) - > i_mmap_sem ) and
* i_mmap_rwsem ( inode - > i_mmap_rwsem ) !
*
* page fault path :
* mmap_sem - > sb_start_pagefault - > i_mmap_sem ( r ) - > transaction start - >
* page lock - > i_data_sem ( rw )
*
* buffered write path :
* sb_start_write - > i_mutex - > mmap_sem
* sb_start_write - > i_mutex - > transaction start - > page lock - >
* i_data_sem ( rw )
*
* truncate :
* sb_start_write - > i_mutex - > EXT4_STATE_DIOREAD_LOCK ( w ) - > i_mmap_sem ( w ) - >
* i_mmap_rwsem ( w ) - > page lock
* sb_start_write - > i_mutex - > EXT4_STATE_DIOREAD_LOCK ( w ) - > i_mmap_sem ( w ) - >
* transaction start - > i_data_sem ( rw )
*
* direct IO :
* sb_start_write - > i_mutex - > EXT4_STATE_DIOREAD_LOCK ( r ) - > mmap_sem
* sb_start_write - > i_mutex - > EXT4_STATE_DIOREAD_LOCK ( r ) - >
* transaction start - > i_data_sem ( rw )
*
* writepages :
* transaction start - > page lock ( s ) - > i_data_sem ( rw )
*/
2015-06-18 17:52:29 +03:00
# if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
2011-04-19 01:29:14 +04:00
static struct file_system_type ext2_fs_type = {
. owner = THIS_MODULE ,
. name = " ext2 " ,
. mount = ext4_mount ,
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV ,
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext2 " ) ;
2013-03-13 05:27:41 +04:00
MODULE_ALIAS ( " ext2 " ) ;
2011-04-19 01:29:14 +04:00
# define IS_EXT2_SB(sb) ((sb)->s_bdev->bd_holder == &ext2_fs_type)
# else
# define IS_EXT2_SB(sb) (0)
# endif
2010-03-25 03:18:37 +03:00
static struct file_system_type ext3_fs_type = {
. owner = THIS_MODULE ,
. name = " ext3 " ,
2010-07-25 00:46:55 +04:00
. mount = ext4_mount ,
2010-03-25 03:18:37 +03:00
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV ,
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext3 " ) ;
2013-03-13 05:27:41 +04:00
MODULE_ALIAS ( " ext3 " ) ;
2010-03-25 03:18:37 +03:00
# define IS_EXT3_SB(sb) ((sb)->s_bdev->bd_holder == &ext3_fs_type)
2006-10-11 12:21:10 +04:00
2012-04-30 02:25:10 +04:00
static int ext4_verify_csum_type ( struct super_block * sb ,
struct ext4_super_block * es )
{
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_metadata_csum ( sb ) )
2012-04-30 02:25:10 +04:00
return 1 ;
return es - > s_checksum_type = = EXT4_CRC32C_CHKSUM ;
}
2012-04-30 02:29:10 +04:00
static __le32 ext4_superblock_csum ( struct super_block * sb ,
struct ext4_super_block * es )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
int offset = offsetof ( struct ext4_super_block , s_checksum ) ;
__u32 csum ;
csum = ext4_chksum ( sbi , ~ 0 , ( char * ) es , offset ) ;
return cpu_to_le32 ( csum ) ;
}
2014-05-12 18:50:23 +04:00
static int ext4_superblock_csum_verify ( struct super_block * sb ,
struct ext4_super_block * es )
2012-04-30 02:29:10 +04:00
{
2014-10-13 11:36:16 +04:00
if ( ! ext4_has_metadata_csum ( sb ) )
2012-04-30 02:29:10 +04:00
return 1 ;
return es - > s_checksum = = ext4_superblock_csum ( sb , es ) ;
}
2012-10-10 09:06:58 +04:00
void ext4_superblock_csum_set ( struct super_block * sb )
2012-04-30 02:29:10 +04:00
{
2012-10-10 09:06:58 +04:00
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2014-10-13 11:36:16 +04:00
if ( ! ext4_has_metadata_csum ( sb ) )
2012-04-30 02:29:10 +04:00
return ;
es - > s_checksum = ext4_superblock_csum ( sb , es ) ;
}
2011-08-01 16:45:02 +04:00
void * ext4_kvmalloc ( size_t size , gfp_t flags )
{
void * ret ;
2013-06-19 23:15:53 +04:00
ret = kmalloc ( size , flags | __GFP_NOWARN ) ;
2011-08-01 16:45:02 +04:00
if ( ! ret )
ret = __vmalloc ( size , flags , PAGE_KERNEL ) ;
return ret ;
}
void * ext4_kvzalloc ( size_t size , gfp_t flags )
{
void * ret ;
2013-06-19 23:15:53 +04:00
ret = kzalloc ( size , flags | __GFP_NOWARN ) ;
2011-08-01 16:45:02 +04:00
if ( ! ret )
ret = __vmalloc ( size , flags | __GFP_ZERO , PAGE_KERNEL ) ;
return ret ;
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_block_bitmap ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_block_bitmap_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_block_bitmap_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_inode_bitmap ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_inode_bitmap_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_inode_bitmap_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
ext4_fsblk_t ext4_inode_table ( struct super_block * sb ,
struct ext4_group_desc * bg )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
return le32_to_cpu ( bg - > bg_inode_table_lo ) |
2006-10-11 12:21:15 +04:00
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( ext4_fsblk_t ) le32_to_cpu ( bg - > bg_inode_table_hi ) < < 32 : 0 ) ;
2006-10-11 12:21:10 +04:00
}
2011-09-10 03:08:51 +04:00
__u32 ext4_free_group_clusters ( struct super_block * sb ,
struct ext4_group_desc * bg )
2009-01-06 06:20:24 +03:00
{
return le16_to_cpu ( bg - > bg_free_blocks_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_free_blocks_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_free_inodes_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_free_inodes_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_free_inodes_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_used_dirs_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_used_dirs_count_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_used_dirs_count_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
__u32 ext4_itable_unused_count ( struct super_block * sb ,
struct ext4_group_desc * bg )
{
return le16_to_cpu ( bg - > bg_itable_unused_lo ) |
( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT ?
2009-06-04 01:59:28 +04:00
( __u32 ) le16_to_cpu ( bg - > bg_itable_unused_hi ) < < 16 : 0 ) ;
2009-01-06 06:20:24 +03:00
}
2006-10-11 12:21:15 +04:00
void ext4_block_bitmap_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_block_bitmap_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_block_bitmap_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
void ext4_inode_bitmap_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_inode_bitmap_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_inode_bitmap_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2006-10-11 12:21:15 +04:00
void ext4_inode_table_set ( struct super_block * sb ,
struct ext4_group_desc * bg , ext4_fsblk_t blk )
2006-10-11 12:21:10 +04:00
{
2007-10-17 02:38:25 +04:00
bg - > bg_inode_table_lo = cpu_to_le32 ( ( u32 ) blk ) ;
2006-10-11 12:21:15 +04:00
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_inode_table_hi = cpu_to_le32 ( blk > > 32 ) ;
2006-10-11 12:21:10 +04:00
}
2011-09-10 03:08:51 +04:00
void ext4_free_group_clusters_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
2009-01-06 06:20:24 +03:00
{
bg - > bg_free_blocks_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_free_blocks_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_free_inodes_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_free_inodes_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_free_inodes_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_used_dirs_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_used_dirs_count_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_used_dirs_count_hi = cpu_to_le16 ( count > > 16 ) ;
}
void ext4_itable_unused_set ( struct super_block * sb ,
struct ext4_group_desc * bg , __u32 count )
{
bg - > bg_itable_unused_lo = cpu_to_le16 ( ( __u16 ) count ) ;
if ( EXT4_DESC_SIZE ( sb ) > = EXT4_MIN_DESC_SIZE_64BIT )
bg - > bg_itable_unused_hi = cpu_to_le16 ( count > > 16 ) ;
}
2009-09-29 19:01:03 +04:00
2010-07-27 19:56:03 +04:00
static void __save_error_info ( struct super_block * sb , const char * func ,
unsigned int line )
{
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ERROR_FS ;
2015-05-15 01:37:30 +03:00
if ( bdev_read_only ( sb - > s_bdev ) )
return ;
2010-07-27 19:56:03 +04:00
es - > s_state | = cpu_to_le16 ( EXT4_ERROR_FS ) ;
es - > s_last_error_time = cpu_to_le32 ( get_seconds ( ) ) ;
strncpy ( es - > s_last_error_func , func , sizeof ( es - > s_last_error_func ) ) ;
es - > s_last_error_line = cpu_to_le32 ( line ) ;
if ( ! es - > s_first_error_time ) {
es - > s_first_error_time = es - > s_last_error_time ;
strncpy ( es - > s_first_error_func , func ,
sizeof ( es - > s_first_error_func ) ) ;
es - > s_first_error_line = cpu_to_le32 ( line ) ;
es - > s_first_error_ino = es - > s_last_error_ino ;
es - > s_first_error_block = es - > s_last_error_block ;
}
2010-07-27 19:56:04 +04:00
/*
* Start the daily error reporting function if it hasn ' t been
* started already
*/
if ( ! es - > s_error_count )
mod_timer ( & EXT4_SB ( sb ) - > s_err_report , jiffies + 24 * 60 * 60 * HZ ) ;
2012-09-27 17:37:53 +04:00
le32_add_cpu ( & es - > s_error_count , 1 ) ;
2010-07-27 19:56:03 +04:00
}
static void save_error_info ( struct super_block * sb , const char * func ,
unsigned int line )
{
__save_error_info ( sb , func , line ) ;
ext4_commit_super ( sb , 1 ) ;
}
2015-08-16 17:03:57 +03:00
/*
* The del_gendisk ( ) function uninitializes the disk - specific data
* structures , including the bdi structure , without telling anyone
* else . Once this happens , any attempt to call mark_buffer_dirty ( )
* ( for example , by ext4_commit_super ) , will cause a kernel OOPS .
* This is a kludge to prevent these oops until we can put in a proper
* hook in del_gendisk ( ) to inform the VFS and file system layers .
*/
static int block_device_ejected ( struct super_block * sb )
{
struct inode * bd_inode = sb - > s_bdev - > bd_inode ;
struct backing_dev_info * bdi = inode_to_bdi ( bd_inode ) ;
return bdi - > dev = = NULL ;
}
2012-02-21 02:53:02 +04:00
static void ext4_journal_commit_callback ( journal_t * journal , transaction_t * txn )
{
struct super_block * sb = journal - > j_private ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
int error = is_journal_aborted ( journal ) ;
2013-04-04 06:08:52 +04:00
struct ext4_journal_cb_entry * jce ;
2012-02-21 02:53:02 +04:00
2013-04-04 06:08:52 +04:00
BUG_ON ( txn - > t_state = = T_FINISHED ) ;
2017-06-23 06:54:33 +03:00
ext4_process_freed_data ( sb , txn - > t_tid ) ;
2012-02-21 02:53:02 +04:00
spin_lock ( & sbi - > s_md_lock ) ;
2013-04-04 06:08:52 +04:00
while ( ! list_empty ( & txn - > t_private_list ) ) {
jce = list_entry ( txn - > t_private_list . next ,
struct ext4_journal_cb_entry , jce_list ) ;
2012-02-21 02:53:02 +04:00
list_del_init ( & jce - > jce_list ) ;
spin_unlock ( & sbi - > s_md_lock ) ;
jce - > jce_func ( sb , jce , error ) ;
spin_lock ( & sbi - > s_md_lock ) ;
}
spin_unlock ( & sbi - > s_md_lock ) ;
}
2010-07-27 19:56:03 +04:00
2006-10-11 12:20:50 +04:00
/* Deal with the reporting of failure conditions on a filesystem such as
* inconsistencies detected or read IO failures .
*
* On ext2 , we can store the error state of the filesystem in the
2006-10-11 12:20:53 +04:00
* superblock . That is not possible on ext4 , because we may have other
2006-10-11 12:20:50 +04:00
* write ordering constraints on the superblock which prevent us from
* writing it out straight away ; and given that the journal is about to
* be aborted , we can ' t rely on the current , or future , transactions to
* write out the superblock safely .
*
2006-10-11 12:21:01 +04:00
* We ' ll just use the jbd2_journal_abort ( ) error code to record an error in
2010-01-18 00:10:07 +03:00
* the journal instead . On recovery , the journal will complain about
2006-10-11 12:20:50 +04:00
* that error until we ' ve noted it down and cleared it .
*/
2006-10-11 12:20:53 +04:00
static void ext4_handle_error ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
if ( sb - > s_flags & MS_RDONLY )
return ;
2008-07-27 00:15:44 +04:00
if ( ! test_opt ( sb , ERRORS_CONT ) ) {
2006-10-11 12:20:53 +04:00
journal_t * journal = EXT4_SB ( sb ) - > s_journal ;
2006-10-11 12:20:50 +04:00
2009-06-13 18:09:36 +04:00
EXT4_SB ( sb ) - > s_mount_flags | = EXT4_MF_FS_ABORTED ;
2006-10-11 12:20:50 +04:00
if ( journal )
2006-10-11 12:21:01 +04:00
jbd2_journal_abort ( journal , - EIO ) ;
2006-10-11 12:20:50 +04:00
}
2008-07-27 00:15:44 +04:00
if ( test_opt ( sb , ERRORS_RO ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_CRIT , " Remounting filesystem read-only " ) ;
2013-06-13 06:38:04 +04:00
/*
* Make sure updated value of - > s_mount_flags will be visible
* before - > s_flags update
*/
smp_wmb ( ) ;
2006-10-11 12:20:50 +04:00
sb - > s_flags | = MS_RDONLY ;
}
2015-10-19 00:02:56 +03:00
if ( test_opt ( sb , ERRORS_PANIC ) ) {
if ( EXT4_SB ( sb ) - > s_journal & &
! ( EXT4_SB ( sb ) - > s_journal - > j_flags & JBD2_REC_ERR ) )
return ;
2006-10-11 12:20:53 +04:00
panic ( " EXT4-fs (device %s): panic forced after error \n " ,
2006-10-11 12:20:50 +04:00
sb - > s_id ) ;
2015-10-19 00:02:56 +03:00
}
2006-10-11 12:20:50 +04:00
}
2013-10-18 05:11:01 +04:00
# define ext4_error_ratelimit(sb) \
___ratelimit ( & ( EXT4_SB ( sb ) - > s_err_ratelimit_state ) , \
" EXT4-fs error " )
2010-02-15 22:19:27 +03:00
void __ext4_error ( struct super_block * sb , const char * function ,
2010-07-27 19:56:40 +04:00
unsigned int line , const char * fmt , . . . )
2006-10-11 12:20:50 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2006-10-11 12:20:50 +04:00
va_list args ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: comm %s: %pV \n " ,
sb - > s_id , function , line , current - > comm , & vaf ) ;
va_end ( args ) ;
}
2012-05-31 07:00:16 +04:00
save_error_info ( sb , function , line ) ;
2006-10-11 12:20:53 +04:00
ext4_handle_error ( sb ) ;
2006-10-11 12:20:50 +04:00
}
2013-07-01 16:12:37 +04:00
void __ext4_error_inode ( struct inode * inode , const char * function ,
unsigned int line , ext4_fsblk_t block ,
const char * fmt , . . . )
2010-03-02 19:46:09 +03:00
{
va_list args ;
2011-01-10 20:10:55 +03:00
struct va_format vaf ;
2010-07-27 19:56:03 +04:00
struct ext4_super_block * es = EXT4_SB ( inode - > i_sb ) - > s_es ;
2010-03-02 19:46:09 +03:00
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( inode - > i_sb ) ) ) )
return ;
2010-07-27 19:56:03 +04:00
es - > s_last_error_ino = cpu_to_le32 ( inode - > i_ino ) ;
es - > s_last_error_block = cpu_to_le64 ( block ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( inode - > i_sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
if ( block )
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: "
" inode #%lu: block %llu: comm %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
block , current - > comm , & vaf ) ;
else
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: "
" inode #%lu: comm %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
current - > comm , & vaf ) ;
va_end ( args ) ;
}
2010-07-27 19:56:03 +04:00
save_error_info ( inode - > i_sb , function , line ) ;
2010-03-02 19:46:09 +03:00
ext4_handle_error ( inode - > i_sb ) ;
}
2013-07-01 16:12:37 +04:00
void __ext4_error_file ( struct file * file , const char * function ,
unsigned int line , ext4_fsblk_t block ,
const char * fmt , . . . )
2010-03-02 19:46:09 +03:00
{
va_list args ;
2011-01-10 20:10:55 +03:00
struct va_format vaf ;
2010-07-27 19:56:03 +04:00
struct ext4_super_block * es ;
2013-01-24 02:07:38 +04:00
struct inode * inode = file_inode ( file ) ;
2010-03-02 19:46:09 +03:00
char pathname [ 80 ] , * path ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( inode - > i_sb ) ) ) )
return ;
2010-07-27 19:56:03 +04:00
es = EXT4_SB ( inode - > i_sb ) - > s_es ;
es - > s_last_error_ino = cpu_to_le32 ( inode - > i_ino ) ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( inode - > i_sb ) ) {
2015-06-19 11:29:13 +03:00
path = file_path ( file , pathname , sizeof ( pathname ) ) ;
2013-10-18 05:11:01 +04:00
if ( IS_ERR ( path ) )
path = " (unknown) " ;
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
if ( block )
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: inode #%lu: "
" block %llu: comm %s: path %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
block , current - > comm , path , & vaf ) ;
else
printk ( KERN_CRIT
" EXT4-fs error (device %s): %s:%d: inode #%lu: "
" comm %s: path %s: %pV \n " ,
inode - > i_sb - > s_id , function , line , inode - > i_ino ,
current - > comm , path , & vaf ) ;
va_end ( args ) ;
}
2010-07-27 19:56:03 +04:00
save_error_info ( inode - > i_sb , function , line ) ;
2010-03-02 19:46:09 +03:00
ext4_handle_error ( inode - > i_sb ) ;
}
2013-02-08 22:00:31 +04:00
const char * ext4_decode_error ( struct super_block * sb , int errno ,
char nbuf [ 16 ] )
2006-10-11 12:20:50 +04:00
{
char * errstr = NULL ;
switch ( errno ) {
2015-10-17 23:16:04 +03:00
case - EFSCORRUPTED :
errstr = " Corrupt filesystem " ;
break ;
case - EFSBADCRC :
errstr = " Filesystem failed CRC " ;
break ;
2006-10-11 12:20:50 +04:00
case - EIO :
errstr = " IO failure " ;
break ;
case - ENOMEM :
errstr = " Out of memory " ;
break ;
case - EROFS :
2009-07-28 07:09:47 +04:00
if ( ! sb | | ( EXT4_SB ( sb ) - > s_journal & &
EXT4_SB ( sb ) - > s_journal - > j_flags & JBD2_ABORT ) )
2006-10-11 12:20:50 +04:00
errstr = " Journal has aborted " ;
else
errstr = " Readonly filesystem " ;
break ;
default :
/* If the caller passed in an extra buffer for unknown
* errors , textualise them now . Else we just return
* NULL . */
if ( nbuf ) {
/* Check for truncated error codes... */
if ( snprintf ( nbuf , 16 , " error %d " , - errno ) > = 0 )
errstr = nbuf ;
}
break ;
}
return errstr ;
}
2006-10-11 12:20:53 +04:00
/* __ext4_std_error decodes expected errors from journaling functions
2006-10-11 12:20:50 +04:00
* automatically and invokes the appropriate error response . */
2010-07-27 19:56:40 +04:00
void __ext4_std_error ( struct super_block * sb , const char * function ,
unsigned int line , int errno )
2006-10-11 12:20:50 +04:00
{
char nbuf [ 16 ] ;
const char * errstr ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2006-10-11 12:20:50 +04:00
/* Special case: if the error is EROFS, and we're not already
* inside a transaction , then there ' s really no point in logging
* an error . */
if ( errno = = - EROFS & & journal_current_handle ( ) = = NULL & &
( sb - > s_flags & MS_RDONLY ) )
return ;
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
errstr = ext4_decode_error ( sb , errno , nbuf ) ;
printk ( KERN_CRIT " EXT4-fs error (device %s) in %s:%d: %s \n " ,
sb - > s_id , function , line , errstr ) ;
}
2006-10-11 12:20:50 +04:00
2013-10-18 05:11:01 +04:00
save_error_info ( sb , function , line ) ;
2006-10-11 12:20:53 +04:00
ext4_handle_error ( sb ) ;
2006-10-11 12:20:50 +04:00
}
/*
2006-10-11 12:20:53 +04:00
* ext4_abort is a much stronger failure handler than ext4_error . The
2006-10-11 12:20:50 +04:00
* abort function may be used to deal with unrecoverable failures such
* as journal IO errors or ENOMEM at a critical moment in log management .
*
* We unconditionally force the filesystem into an ABORT | READONLY state ,
* unless the error response on the fs has been set to panic in which
* case we take the easy way out and panic immediately .
*/
2010-06-29 19:07:07 +04:00
void __ext4_abort ( struct super_block * sb , const char * function ,
2010-07-27 19:56:40 +04:00
unsigned int line , const char * fmt , . . . )
2006-10-11 12:20:50 +04:00
{
2016-10-13 06:12:53 +03:00
struct va_format vaf ;
2006-10-11 12:20:50 +04:00
va_list args ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2010-07-27 19:56:03 +04:00
save_error_info ( sb , function , line ) ;
2006-10-11 12:20:50 +04:00
va_start ( args , fmt ) ;
2016-10-13 06:12:53 +03:00
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: %pV \n " ,
sb - > s_id , function , line , & vaf ) ;
2006-10-11 12:20:50 +04:00
va_end ( args ) ;
2010-07-27 19:56:03 +04:00
if ( ( sb - > s_flags & MS_RDONLY ) = = 0 ) {
ext4_msg ( sb , KERN_CRIT , " Remounting filesystem read-only " ) ;
EXT4_SB ( sb ) - > s_mount_flags | = EXT4_MF_FS_ABORTED ;
2013-06-13 06:38:04 +04:00
/*
* Make sure updated value of - > s_mount_flags will be visible
* before - > s_flags update
*/
smp_wmb ( ) ;
sb - > s_flags | = MS_RDONLY ;
2010-07-27 19:56:03 +04:00
if ( EXT4_SB ( sb ) - > s_journal )
jbd2_journal_abort ( EXT4_SB ( sb ) - > s_journal , - EIO ) ;
save_error_info ( sb , function , line ) ;
}
2015-10-19 00:02:56 +03:00
if ( test_opt ( sb , ERRORS_PANIC ) ) {
if ( EXT4_SB ( sb ) - > s_journal & &
! ( EXT4_SB ( sb ) - > s_journal - > j_flags & JBD2_REC_ERR ) )
return ;
2006-10-11 12:20:53 +04:00
panic ( " EXT4-fs panic from previous error \n " ) ;
2015-10-19 00:02:56 +03:00
}
2006-10-11 12:20:50 +04:00
}
2013-07-01 16:12:37 +04:00
void __ext4_msg ( struct super_block * sb ,
const char * prefix , const char * fmt , . . . )
2009-06-05 01:36:36 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2009-06-05 01:36:36 +04:00
va_list args ;
2013-10-18 05:11:01 +04:00
if ( ! ___ratelimit ( & ( EXT4_SB ( sb ) - > s_msg_ratelimit_state ) , " EXT4-fs " ) )
return ;
2009-06-05 01:36:36 +04:00
va_start ( args , fmt ) ;
2010-12-20 06:43:19 +03:00
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( " %sEXT4-fs (%s): %pV \n " , prefix , sb - > s_id , & vaf ) ;
2009-06-05 01:36:36 +04:00
va_end ( args ) ;
}
2015-06-15 21:50:26 +03:00
# define ext4_warning_ratelimit(sb) \
___ratelimit ( & ( EXT4_SB ( sb ) - > s_warning_ratelimit_state ) , \
" EXT4-fs warning " )
2010-02-15 22:19:27 +03:00
void __ext4_warning ( struct super_block * sb , const char * function ,
2010-07-27 19:56:40 +04:00
unsigned int line , const char * fmt , . . . )
2006-10-11 12:20:50 +04:00
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2006-10-11 12:20:50 +04:00
va_list args ;
2015-06-15 21:50:26 +03:00
if ( ! ext4_warning_ratelimit ( sb ) )
2013-10-18 05:11:01 +04:00
return ;
2006-10-11 12:20:50 +04:00
va_start ( args , fmt ) ;
2010-12-20 06:43:19 +03:00
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_WARNING " EXT4-fs warning (device %s): %s:%d: %pV \n " ,
sb - > s_id , function , line , & vaf ) ;
2006-10-11 12:20:50 +04:00
va_end ( args ) ;
}
2015-06-15 21:50:26 +03:00
void __ext4_warning_inode ( const struct inode * inode , const char * function ,
unsigned int line , const char * fmt , . . . )
{
struct va_format vaf ;
va_list args ;
if ( ! ext4_warning_ratelimit ( inode - > i_sb ) )
return ;
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_WARNING " EXT4-fs warning (device %s): %s:%d: "
" inode #%lu: comm %s: %pV \n " , inode - > i_sb - > s_id ,
function , line , inode - > i_ino , current - > comm , & vaf ) ;
va_end ( args ) ;
}
2010-06-29 20:54:28 +04:00
void __ext4_grp_locked_error ( const char * function , unsigned int line ,
struct super_block * sb , ext4_group_t grp ,
unsigned long ino , ext4_fsblk_t block ,
const char * fmt , . . . )
2009-01-06 06:19:52 +03:00
__releases ( bitlock )
__acquires ( bitlock )
{
2010-12-20 06:43:19 +03:00
struct va_format vaf ;
2009-01-06 06:19:52 +03:00
va_list args ;
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return ;
2010-07-27 19:56:03 +04:00
es - > s_last_error_ino = cpu_to_le32 ( ino ) ;
es - > s_last_error_block = cpu_to_le64 ( block ) ;
__save_error_info ( sb , function , line ) ;
2010-12-20 06:43:19 +03:00
2013-10-18 05:11:01 +04:00
if ( ext4_error_ratelimit ( sb ) ) {
va_start ( args , fmt ) ;
vaf . fmt = fmt ;
vaf . va = & args ;
printk ( KERN_CRIT " EXT4-fs error (device %s): %s:%d: group %u, " ,
sb - > s_id , function , line , grp ) ;
if ( ino )
printk ( KERN_CONT " inode %lu: " , ino ) ;
if ( block )
printk ( KERN_CONT " block %llu: " ,
( unsigned long long ) block ) ;
printk ( KERN_CONT " %pV \n " , & vaf ) ;
va_end ( args ) ;
}
2009-01-06 06:19:52 +03:00
if ( test_opt ( sb , ERRORS_CONT ) ) {
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 0 ) ;
2009-01-06 06:19:52 +03:00
return ;
}
2010-07-27 19:56:03 +04:00
2009-01-06 06:19:52 +03:00
ext4_unlock_group ( sb , grp ) ;
ext4_handle_error ( sb ) ;
/*
* We only get here in the ERRORS_RO case ; relocking the group
* may be dangerous , but nothing bad will happen since the
* filesystem will have already been marked read / only and the
* journal has been aborted . We return 1 as a hint to callers
* who might what to use the return value from
2011-03-31 05:57:33 +04:00
* ext4_grp_locked_error ( ) to distinguish between the
2009-01-06 06:19:52 +03:00
* ERRORS_CONT and ERRORS_RO case , and perhaps return more
* aggressively from the ext4 function in question , with a
* more appropriate error code .
*/
ext4_lock_group ( sb , grp ) ;
return ;
}
2006-10-11 12:20:53 +04:00
void ext4_update_dynamic_rev ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) > EXT4_GOOD_OLD_REV )
2006-10-11 12:20:50 +04:00
return ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb ,
2006-10-11 12:20:50 +04:00
" updating to rev %d because of new feature flag, "
" running e2fsck is recommended " ,
2006-10-11 12:20:53 +04:00
EXT4_DYNAMIC_REV ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
es - > s_first_ino = cpu_to_le32 ( EXT4_GOOD_OLD_FIRST_INO ) ;
es - > s_inode_size = cpu_to_le16 ( EXT4_GOOD_OLD_INODE_SIZE ) ;
es - > s_rev_level = cpu_to_le32 ( EXT4_DYNAMIC_REV ) ;
2006-10-11 12:20:50 +04:00
/* leave es->s_feature_*compat flags alone */
/* es->s_uuid will be set by e2fsck if empty */
/*
* The rest of the superblock fields should be zero , and if not it
* means they are likely already in use , so leave them alone . We
* can leave it up to e2fsck to clean up any inconsistencies there .
*/
}
/*
* Open the external journal device
*/
2009-06-05 01:36:36 +04:00
static struct block_device * ext4_blkdev_get ( dev_t dev , struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
struct block_device * bdev ;
char b [ BDEVNAME_SIZE ] ;
2010-11-13 13:55:18 +03:00
bdev = blkdev_get_by_dev ( dev , FMODE_READ | FMODE_WRITE | FMODE_EXCL , sb ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( bdev ) )
goto fail ;
return bdev ;
fail :
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " failed to open journal device %s: %ld " ,
2006-10-11 12:20:50 +04:00
__bdevname ( dev , b ) , PTR_ERR ( bdev ) ) ;
return NULL ;
}
/*
* Release the journal device
*/
2013-05-06 06:11:03 +04:00
static void ext4_blkdev_put ( struct block_device * bdev )
2006-10-11 12:20:50 +04:00
{
2013-05-06 06:11:03 +04:00
blkdev_put ( bdev , FMODE_READ | FMODE_WRITE | FMODE_EXCL ) ;
2006-10-11 12:20:50 +04:00
}
2013-05-06 06:11:03 +04:00
static void ext4_blkdev_remove ( struct ext4_sb_info * sbi )
2006-10-11 12:20:50 +04:00
{
struct block_device * bdev ;
bdev = sbi - > journal_bdev ;
if ( bdev ) {
2013-05-06 06:11:03 +04:00
ext4_blkdev_put ( bdev ) ;
2006-10-11 12:20:50 +04:00
sbi - > journal_bdev = NULL ;
}
}
static inline struct inode * orphan_list_entry ( struct list_head * l )
{
2006-10-11 12:20:53 +04:00
return & list_entry ( l , struct ext4_inode_info , i_orphan ) - > vfs_inode ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static void dump_orphan_list ( struct super_block * sb , struct ext4_sb_info * sbi )
2006-10-11 12:20:50 +04:00
{
struct list_head * l ;
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " sb orphan head is %d " ,
le32_to_cpu ( sbi - > s_es - > s_last_orphan ) ) ;
2006-10-11 12:20:50 +04:00
printk ( KERN_ERR " sb_info orphan list: \n " ) ;
list_for_each ( l , & sbi - > s_orphan ) {
struct inode * inode = orphan_list_entry ( l ) ;
printk ( KERN_ERR " "
" inode %s:%lu at %p: mode %o, nlink %d, next %d \n " ,
inode - > i_sb - > s_id , inode - > i_ino , inode ,
inode - > i_mode , inode - > i_nlink ,
NEXT_ORPHAN ( inode ) ) ;
}
}
2017-04-06 16:40:06 +03:00
# ifdef CONFIG_QUOTA
static int ext4_quota_off ( struct super_block * sb , int type ) ;
static inline void ext4_quota_off_umount ( struct super_block * sb )
{
int type ;
2017-05-22 05:31:23 +03:00
/* Use our quota_off function to clear inode flags etc. */
for ( type = 0 ; type < EXT4_MAXQUOTAS ; type + + )
ext4_quota_off ( sb , type ) ;
2017-04-06 16:40:06 +03:00
}
# else
static inline void ext4_quota_off_umount ( struct super_block * sb )
{
}
# endif
2008-07-27 00:15:44 +04:00
static void ext4_put_super ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2017-02-05 07:38:06 +03:00
int aborted = 0 ;
2008-10-28 05:53:05 +03:00
int i , err ;
2006-10-11 12:20:50 +04:00
2010-10-28 05:30:05 +04:00
ext4_unregister_li_request ( sb ) ;
2017-04-06 16:40:06 +03:00
ext4_quota_off_umount ( sb ) ;
2010-05-19 15:16:42 +04:00
2013-06-04 22:21:02 +04:00
flush_workqueue ( sbi - > rsv_conversion_wq ) ;
destroy_workqueue ( sbi - > rsv_conversion_wq ) ;
2009-09-28 23:48:41 +04:00
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal ) {
2017-02-05 07:38:06 +03:00
aborted = is_journal_aborted ( sbi - > s_journal ) ;
2009-01-07 08:06:22 +03:00
err = jbd2_journal_destroy ( sbi - > s_journal ) ;
sbi - > s_journal = NULL ;
2017-02-05 07:38:06 +03:00
if ( ( err < 0 ) & & ! aborted )
2010-06-29 19:07:07 +04:00
ext4_abort ( sb , " Couldn't clean up the journal " ) ;
2009-01-07 08:06:22 +03:00
}
2009-12-09 05:48:58 +03:00
2015-09-23 19:46:17 +03:00
ext4_unregister_sysfs ( sb ) ;
2013-07-01 16:12:37 +04:00
ext4_es_unregister_shrinker ( sbi ) ;
2013-12-09 05:52:31 +04:00
del_timer_sync ( & sbi - > s_err_report ) ;
2009-12-09 05:48:58 +03:00
ext4_release_system_zone ( sb ) ;
ext4_mb_release ( sb ) ;
ext4_ext_release ( sb ) ;
2017-02-05 07:38:06 +03:00
if ( ! ( sb - > s_flags & MS_RDONLY ) & & ! aborted ) {
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2006-10-11 12:20:50 +04:00
es - > s_state = cpu_to_le16 ( sbi - > s_mount_state ) ;
}
2012-07-23 04:33:31 +04:00
if ( ! ( sb - > s_flags & MS_RDONLY ) )
2012-03-22 06:29:15 +04:00
ext4_commit_super ( sb , 1 ) ;
2006-10-11 12:20:50 +04:00
for ( i = 0 ; i < sbi - > s_gdb_count ; i + + )
brelse ( sbi - > s_group_desc [ i ] ) ;
2014-11-20 20:19:11 +03:00
kvfree ( sbi - > s_group_desc ) ;
kvfree ( sbi - > s_flex_groups ) ;
2011-09-10 02:56:51 +04:00
percpu_counter_destroy ( & sbi - > s_freeclusters_counter ) ;
2006-10-11 12:20:50 +04:00
percpu_counter_destroy ( & sbi - > s_freeinodes_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirs_counter ) ;
2011-09-10 02:56:51 +04:00
percpu_counter_destroy ( & sbi - > s_dirtyclusters_counter ) ;
2016-04-26 06:22:35 +03:00
percpu_free_rwsem ( & sbi - > s_journal_flag_rwsem ) ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2006-10-11 12:20:50 +04:00
kfree ( sbi - > s_qf_names [ i ] ) ;
# endif
/* Debugging code just in case the in-memory inode orphan list
* isn ' t empty . The on - disk one can be non - empty if we ' ve
* detected an error and taken the fs readonly , but the
* in - memory list had better be clean by this point . */
if ( ! list_empty ( & sbi - > s_orphan ) )
dump_orphan_list ( sb , sbi ) ;
J_ASSERT ( list_empty ( & sbi - > s_orphan ) ) ;
2015-06-21 05:50:33 +03:00
sync_blockdev ( sb - > s_bdev ) ;
2007-05-07 01:49:54 +04:00
invalidate_bdev ( sb - > s_bdev ) ;
2006-10-11 12:20:50 +04:00
if ( sbi - > journal_bdev & & sbi - > journal_bdev ! = sb - > s_bdev ) {
/*
* Invalidate the journal device ' s buffers . We don ' t want them
* floating about in memory - the physical journal device may
* hotswapped , and it breaks the ` ro - after ' testing code .
*/
sync_blockdev ( sbi - > journal_bdev ) ;
2007-05-07 01:49:54 +04:00
invalidate_bdev ( sbi - > journal_bdev ) ;
2006-10-11 12:20:53 +04:00
ext4_blkdev_remove ( sbi ) ;
2006-10-11 12:20:50 +04:00
}
2017-06-22 18:44:55 +03:00
if ( sbi - > s_ea_inode_cache ) {
ext4_xattr_destroy_cache ( sbi - > s_ea_inode_cache ) ;
sbi - > s_ea_inode_cache = NULL ;
}
2017-06-22 18:28:55 +03:00
if ( sbi - > s_ea_block_cache ) {
ext4_xattr_destroy_cache ( sbi - > s_ea_block_cache ) ;
sbi - > s_ea_block_cache = NULL ;
2014-03-19 03:24:49 +04:00
}
2011-05-25 02:31:25 +04:00
if ( sbi - > s_mmp_tsk )
kthread_stop ( sbi - > s_mmp_tsk ) ;
2016-11-26 22:24:51 +03:00
brelse ( sbi - > s_sbh ) ;
2006-10-11 12:20:50 +04:00
sb - > s_fs_info = NULL ;
2009-03-31 17:10:09 +04:00
/*
* Now that we are completely done shutting down the
* superblock , we need to actually destroy the kobject .
*/
kobject_put ( & sbi - > s_kobj ) ;
wait_for_completion ( & sbi - > s_kobj_unregister ) ;
2012-04-30 02:27:10 +04:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2009-02-16 02:07:52 +03:00
kfree ( sbi - > s_blockgroup_lock ) ;
2006-10-11 12:20:50 +04:00
kfree ( sbi ) ;
}
2006-12-07 07:33:20 +03:00
static struct kmem_cache * ext4_inode_cachep ;
2006-10-11 12:20:50 +04:00
/*
* Called inside transaction , so use GFP_NOFS
*/
2006-10-11 12:20:53 +04:00
static struct inode * ext4_alloc_inode ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_inode_info * ei ;
2006-10-11 12:20:50 +04:00
2006-12-07 07:33:14 +03:00
ei = kmem_cache_alloc ( ext4_inode_cachep , GFP_NOFS ) ;
2006-10-11 12:20:50 +04:00
if ( ! ei )
return NULL ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
ei - > vfs_inode . i_version = 1 ;
2014-04-21 22:37:55 +04:00
spin_lock_init ( & ei - > i_raw_lock ) ;
2008-01-29 08:19:52 +03:00
INIT_LIST_HEAD ( & ei - > i_prealloc_list ) ;
spin_lock_init ( & ei - > i_prealloc_lock ) ;
2012-11-09 06:57:30 +04:00
ext4_es_init_tree ( & ei - > i_es_tree ) ;
rwlock_init ( & ei - > i_es_lock ) ;
2014-11-25 19:45:37 +03:00
INIT_LIST_HEAD ( & ei - > i_es_list ) ;
2014-09-02 06:26:49 +04:00
ei - > i_es_all_nr = 0 ;
2014-11-25 19:45:37 +03:00
ei - > i_es_shk_nr = 0 ;
2014-11-25 19:51:23 +03:00
ei - > i_es_shrink_lblk = 0 ;
2008-07-15 01:52:37 +04:00
ei - > i_reserved_data_blocks = 0 ;
ei - > i_reserved_meta_blocks = 0 ;
ei - > i_allocated_meta_blocks = 0 ;
2010-01-01 10:41:30 +03:00
ei - > i_da_metadata_calc_len = 0 ;
2012-08-06 07:28:16 +04:00
ei - > i_da_metadata_calc_last_lblock = 0 ;
2008-07-15 01:52:37 +04:00
spin_lock_init ( & ( ei - > i_block_reservation_lock ) ) ;
2009-12-14 15:21:14 +03:00
# ifdef CONFIG_QUOTA
ei - > i_reserved_quota = 0 ;
2014-09-29 16:58:25 +04:00
memset ( & ei - > i_dquot , 0 , sizeof ( ei - > i_dquot ) ) ;
2009-12-14 15:21:14 +03:00
# endif
2011-01-10 20:29:43 +03:00
ei - > jinode = NULL ;
2013-06-04 22:21:02 +04:00
INIT_LIST_HEAD ( & ei - > i_rsv_conversion_list ) ;
2010-03-05 00:14:02 +03:00
spin_lock_init ( & ei - > i_completed_io_lock ) ;
2009-12-09 07:51:10 +03:00
ei - > i_sync_tid = 0 ;
ei - > i_datasync_tid = 0 ;
2012-09-29 07:24:52 +04:00
atomic_set ( & ei - > i_unwritten , 0 ) ;
2013-06-04 22:21:02 +04:00
INIT_WORK ( & ei - > i_rsv_conversion_work , ext4_end_io_rsv_work ) ;
2006-10-11 12:20:50 +04:00
return & ei - > vfs_inode ;
}
2010-11-08 21:51:33 +03:00
static int ext4_drop_inode ( struct inode * inode )
{
int drop = generic_drop_inode ( inode ) ;
trace_ext4_drop_inode ( inode , drop ) ;
return drop ;
}
2011-01-07 09:49:49 +03:00
static void ext4_i_callback ( struct rcu_head * head )
{
struct inode * inode = container_of ( head , struct inode , i_rcu ) ;
kmem_cache_free ( ext4_inode_cachep , EXT4_I ( inode ) ) ;
}
2006-10-11 12:20:53 +04:00
static void ext4_destroy_inode ( struct inode * inode )
2006-10-11 12:20:50 +04:00
{
2007-07-16 10:40:45 +04:00
if ( ! list_empty ( & ( EXT4_I ( inode ) - > i_orphan ) ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( inode - > i_sb , KERN_ERR ,
" Inode %lu (%p): orphan list check failed! " ,
inode - > i_ino , EXT4_I ( inode ) ) ;
2007-07-16 10:40:45 +04:00
print_hex_dump ( KERN_INFO , " " , DUMP_PREFIX_ADDRESS , 16 , 4 ,
EXT4_I ( inode ) , sizeof ( struct ext4_inode_info ) ,
true ) ;
dump_stack ( ) ;
}
2011-01-07 09:49:49 +03:00
call_rcu ( & inode - > i_rcu , ext4_i_callback ) ;
2006-10-11 12:20:50 +04:00
}
2008-07-26 06:45:34 +04:00
static void init_once ( void * foo )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_inode_info * ei = ( struct ext4_inode_info * ) foo ;
2006-10-11 12:20:50 +04:00
2007-05-17 09:10:57 +04:00
INIT_LIST_HEAD ( & ei - > i_orphan ) ;
init_rwsem ( & ei - > xattr_sem ) ;
2008-01-29 07:58:26 +03:00
init_rwsem ( & ei - > i_data_sem ) ;
2015-12-07 22:28:03 +03:00
init_rwsem ( & ei - > i_mmap_sem ) ;
2007-05-17 09:10:57 +04:00
inode_init_once ( & ei - > vfs_inode ) ;
2006-10-11 12:20:50 +04:00
}
2014-02-18 05:34:53 +04:00
static int __init init_inodecache ( void )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
ext4_inode_cachep = kmem_cache_create ( " ext4_inode_cache " ,
sizeof ( struct ext4_inode_info ) ,
2006-10-11 12:20:50 +04:00
0 , ( SLAB_RECLAIM_ACCOUNT |
2016-01-15 02:18:21 +03:00
SLAB_MEM_SPREAD | SLAB_ACCOUNT ) ,
2007-07-20 05:11:58 +04:00
init_once ) ;
2006-10-11 12:20:53 +04:00
if ( ext4_inode_cachep = = NULL )
2006-10-11 12:20:50 +04:00
return - ENOMEM ;
return 0 ;
}
static void destroy_inodecache ( void )
{
2012-09-26 05:33:07 +04:00
/*
* Make sure all delayed rcu free inodes are flushed before we
* destroy cache .
*/
rcu_barrier ( ) ;
2006-10-11 12:20:53 +04:00
kmem_cache_destroy ( ext4_inode_cachep ) ;
2006-10-11 12:20:50 +04:00
}
2010-06-07 21:16:22 +04:00
void ext4_clear_inode ( struct inode * inode )
2006-10-11 12:20:50 +04:00
{
2010-06-07 21:16:22 +04:00
invalidate_inode_buffers ( inode ) ;
2012-05-03 16:48:02 +04:00
clear_inode ( inode ) ;
2010-03-03 17:05:05 +03:00
dquot_drop ( inode ) ;
2008-10-10 17:40:52 +04:00
ext4_discard_preallocations ( inode ) ;
2012-11-09 06:57:32 +04:00
ext4_es_remove_extent ( inode , 0 , EXT_MAX_BLOCKS ) ;
2011-01-10 20:29:43 +03:00
if ( EXT4_I ( inode ) - > jinode ) {
jbd2_journal_release_jbd_inode ( EXT4_JOURNAL ( inode ) ,
EXT4_I ( inode ) - > jinode ) ;
jbd2_free_inode ( EXT4_I ( inode ) - > jinode ) ;
EXT4_I ( inode ) - > jinode = NULL ;
}
ext4 crypto: reorganize how we store keys in the inode
This is a pretty massive patch which does a number of different things:
1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.
2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.
3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.
4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.
5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.
6) All this, and we also shrink the number of lines of code by around
100. :-)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 20:17:47 +03:00
# ifdef CONFIG_EXT4_FS_ENCRYPTION
2016-07-10 21:01:03 +03:00
fscrypt_put_encryption_info ( inode , NULL ) ;
ext4 crypto: reorganize how we store keys in the inode
This is a pretty massive patch which does a number of different things:
1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.
2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.
3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.
4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.
5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.
6) All this, and we also shrink the number of lines of code by around
100. :-)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 20:17:47 +03:00
# endif
2006-10-11 12:20:50 +04:00
}
2007-10-22 03:42:08 +04:00
static struct inode * ext4_nfs_get_inode ( struct super_block * sb ,
2009-06-04 01:59:28 +04:00
u64 ino , u32 generation )
2006-10-11 12:20:50 +04:00
{
struct inode * inode ;
2006-10-11 12:20:53 +04:00
if ( ino < EXT4_FIRST_INO ( sb ) & & ino ! = EXT4_ROOT_INO )
2006-10-11 12:20:50 +04:00
return ERR_PTR ( - ESTALE ) ;
2006-10-11 12:20:53 +04:00
if ( ino > le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_inodes_count ) )
2006-10-11 12:20:50 +04:00
return ERR_PTR ( - ESTALE ) ;
/* iget isn't really right if the inode is currently unallocated!!
*
2006-10-11 12:20:53 +04:00
* ext4_read_inode will return a bad_inode if the inode had been
2006-10-11 12:20:50 +04:00
* deleted , so we should be safe .
*
* Currently we don ' t know the generation for parent directory , so
* a generation of 0 means " accept any "
*/
ext4: add ext4_iget_normal() which is to be used for dir tree lookups
If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes. This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.
In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.
Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-06 06:56:00 +04:00
inode = ext4_iget_normal ( sb , ino ) ;
2008-02-07 11:15:37 +03:00
if ( IS_ERR ( inode ) )
return ERR_CAST ( inode ) ;
if ( generation & & inode - > i_generation ! = generation ) {
2006-10-11 12:20:50 +04:00
iput ( inode ) ;
return ERR_PTR ( - ESTALE ) ;
}
2007-10-22 03:42:08 +04:00
return inode ;
}
static struct dentry * ext4_fh_to_dentry ( struct super_block * sb , struct fid * fid ,
2009-06-04 01:59:28 +04:00
int fh_len , int fh_type )
2007-10-22 03:42:08 +04:00
{
return generic_fh_to_dentry ( sb , fid , fh_len , fh_type ,
ext4_nfs_get_inode ) ;
}
static struct dentry * ext4_fh_to_parent ( struct super_block * sb , struct fid * fid ,
2009-06-04 01:59:28 +04:00
int fh_len , int fh_type )
2007-10-22 03:42:08 +04:00
{
return generic_fh_to_parent ( sb , fid , fh_len , fh_type ,
ext4_nfs_get_inode ) ;
2006-10-11 12:20:50 +04:00
}
2009-01-06 06:38:48 +03:00
/*
* Try to release metadata pages ( indirect blocks , directories ) which are
* mapped via the block device . Since these pages could have journal heads
* which would prevent try_to_free_buffers ( ) from freeing them , we must use
* jbd2 layer ' s try_to_free_buffers ( ) function to release them .
*/
2009-06-04 01:59:28 +04:00
static int bdev_try_to_free_page ( struct super_block * sb , struct page * page ,
gfp_t wait )
2009-01-06 06:38:48 +03:00
{
journal_t * journal = EXT4_SB ( sb ) - > s_journal ;
WARN_ON ( PageChecked ( page ) ) ;
if ( ! page_has_buffers ( page ) )
return 0 ;
if ( journal )
return jbd2_journal_try_to_free_buffers ( journal , page ,
2015-11-07 03:28:21 +03:00
wait & ~ __GFP_DIRECT_RECLAIM ) ;
2009-01-06 06:38:48 +03:00
return try_to_free_buffers ( page ) ;
}
2016-07-10 21:01:03 +03:00
# ifdef CONFIG_EXT4_FS_ENCRYPTION
static int ext4_get_context ( struct inode * inode , void * ctx , size_t len )
{
return ext4_xattr_get ( inode , EXT4_XATTR_INDEX_ENCRYPTION ,
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT , ctx , len ) ;
}
static int ext4_set_context ( struct inode * inode , const void * ctx , size_t len ,
void * fs_data )
{
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
handle_t * handle = fs_data ;
2017-06-22 05:28:40 +03:00
int res , res2 , credits , retries = 0 ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
ext4: forbid encrypting root directory
Currently it's possible to encrypt all files and directories on an ext4
filesystem by deleting everything, including lost+found, then setting an
encryption policy on the root directory. However, this is incompatible
with e2fsck because e2fsck expects to find, create, and/or write to
lost+found and does not have access to any encryption keys. Especially
problematic is that if e2fsck can't find lost+found, it will create it
without regard for whether the root directory is encrypted. This is
wrong for obvious reasons, and it causes a later run of e2fsck to
consider the lost+found directory entry to be corrupted.
Encrypting the root directory may also be of limited use because it is
the "all-or-nothing" use case, for which dm-crypt can be used instead.
(By design, encryption policies are inherited and cannot be overridden;
so the root directory having an encryption policy implies that all files
and directories on the filesystem have that same encryption policy.)
In any case, encrypting the root directory is broken currently and must
not be allowed; so start returning an error if userspace requests it.
For now only do this in ext4, because f2fs and ubifs do not appear to
have the lost+found requirement. We could move it into
fscrypt_ioctl_set_policy() later if desired, though.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-06-23 07:10:36 +03:00
/*
* Encrypting the root directory is not allowed because e2fsck expects
* lost + found to exist and be unencrypted , and encrypting the root
* directory would imply encrypting the lost + found directory as well as
* the filename " lost+found " itself .
*/
if ( inode - > i_ino = = EXT4_ROOT_INO )
return - EPERM ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
2017-02-23 00:25:14 +03:00
res = ext4_convert_inline_data ( inode ) ;
if ( res )
return res ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
/*
* If a journal handle was specified , then the encryption context is
* being set on a new inode via inheritance and is part of a larger
* transaction to create the inode . Otherwise the encryption context is
* being set on an existing inode in its own transaction . Only in the
* latter case should the " retry on ENOSPC " logic be used .
*/
2016-07-10 21:01:03 +03:00
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
if ( handle ) {
res = ext4_xattr_set_handle ( handle , inode ,
EXT4_XATTR_INDEX_ENCRYPTION ,
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT ,
ctx , len , 0 ) ;
2016-07-10 21:01:03 +03:00
if ( ! res ) {
ext4_set_inode_flag ( inode , EXT4_INODE_ENCRYPT ) ;
ext4_clear_inode_state ( inode ,
EXT4_STATE_MAY_INLINE_DATA ) ;
2016-11-21 01:32:59 +03:00
/*
* Update inode - > i_flags - e . g . S_DAX may get disabled
*/
ext4_set_inode_flags ( inode ) ;
2016-07-10 21:01:03 +03:00
}
return res ;
}
2017-05-25 01:24:07 +03:00
res = dquot_initialize ( inode ) ;
if ( res )
return res ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
retry :
2017-07-06 07:01:59 +03:00
res = ext4_xattr_set_credits ( inode , len , false /* is_create */ ,
& credits ) ;
2017-06-22 18:44:55 +03:00
if ( res )
return res ;
2017-06-22 05:28:40 +03:00
handle = ext4_journal_start ( inode , EXT4_HT_MISC , credits ) ;
2016-07-10 21:01:03 +03:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
res = ext4_xattr_set_handle ( handle , inode , EXT4_XATTR_INDEX_ENCRYPTION ,
EXT4_XATTR_NAME_ENCRYPTION_CONTEXT ,
ctx , len , 0 ) ;
2016-07-10 21:01:03 +03:00
if ( ! res ) {
ext4_set_inode_flag ( inode , EXT4_INODE_ENCRYPT ) ;
2016-11-21 01:32:59 +03:00
/* Update inode->i_flags - e.g. S_DAX may get disabled */
ext4_set_inode_flags ( inode ) ;
2016-07-10 21:01:03 +03:00
res = ext4_mark_inode_dirty ( handle , inode ) ;
if ( res )
EXT4_ERROR_INODE ( inode , " Failed to mark inode dirty " ) ;
}
res2 = ext4_journal_stop ( handle ) ;
ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:
xfs_io/4594 is trying to acquire lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b
but task is already holding lock:
(jbd2_handle
){++++.+}, at:
[<ffffffff813000de>] start_this_handle+0x354/0x3d8
The abbreviated call stack is:
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
[<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
[<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
[<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
[<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
[<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
[<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
[<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
[<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
[<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
[<ffffffff812bd371>] ext4_create+0xb7/0x161
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 19:52:44 +03:00
if ( res = = - ENOSPC & & ext4_should_retry_alloc ( inode - > i_sb , & retries ) )
goto retry ;
2016-07-10 21:01:03 +03:00
if ( ! res )
res = res2 ;
return res ;
}
2017-06-22 22:14:40 +03:00
static bool ext4_dummy_context ( struct inode * inode )
2016-07-10 21:01:03 +03:00
{
return DUMMY_ENCRYPTION_ENABLED ( EXT4_SB ( inode - > i_sb ) ) ;
}
static unsigned ext4_max_namelen ( struct inode * inode )
{
return S_ISLNK ( inode - > i_mode ) ? inode - > i_sb - > s_blocksize :
EXT4_NAME_LEN ;
}
2017-02-07 23:42:10 +03:00
static const struct fscrypt_operations ext4_cryptops = {
2017-01-06 00:51:18 +03:00
. key_prefix = " ext4: " ,
2016-07-10 21:01:03 +03:00
. get_context = ext4_get_context ,
. set_context = ext4_set_context ,
. dummy_context = ext4_dummy_context ,
. is_encrypted = ext4_encrypted_inode ,
. empty_dir = ext4_empty_dir ,
. max_namelen = ext4_max_namelen ,
} ;
# else
2017-02-07 23:42:10 +03:00
static const struct fscrypt_operations ext4_cryptops = {
2016-07-10 21:01:03 +03:00
. is_encrypted = ext4_encrypted_inode ,
} ;
# endif
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2017-04-30 06:47:50 +03:00
static const char * const quotatypes [ ] = INITQFNAMES ;
2016-01-09 00:01:22 +03:00
# define QTYPE2NAME(t) (quotatypes[t])
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
static int ext4_write_dquot ( struct dquot * dquot ) ;
static int ext4_acquire_dquot ( struct dquot * dquot ) ;
static int ext4_release_dquot ( struct dquot * dquot ) ;
static int ext4_mark_dquot_dirty ( struct dquot * dquot ) ;
static int ext4_write_info ( struct super_block * sb , int type ) ;
2008-04-28 13:14:34 +04:00
static int ext4_quota_on ( struct super_block * sb , int type , int format_id ,
2016-11-21 03:49:34 +03:00
const struct path * path ) ;
2006-10-11 12:20:53 +04:00
static int ext4_quota_on_mount ( struct super_block * sb , int type ) ;
static ssize_t ext4_quota_read ( struct super_block * sb , int type , char * data ,
2006-10-11 12:20:50 +04:00
size_t len , loff_t off ) ;
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_write ( struct super_block * sb , int type ,
2006-10-11 12:20:50 +04:00
const char * data , size_t len , loff_t off ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
static int ext4_quota_enable ( struct super_block * sb , int type , int format_id ,
unsigned int flags ) ;
static int ext4_enable_quotas ( struct super_block * sb ) ;
2016-04-01 19:00:03 +03:00
static int ext4_get_next_id ( struct super_block * sb , struct kqid * qid ) ;
2006-10-11 12:20:50 +04:00
2014-09-29 16:58:25 +04:00
static struct dquot * * ext4_get_dquots ( struct inode * inode )
{
return EXT4_I ( inode ) - > i_dquot ;
}
2009-09-22 04:01:08 +04:00
static const struct dquot_operations ext4_quota_operations = {
2017-06-22 18:46:48 +03:00
. get_reserved_space = ext4_get_reserved_space ,
. write_dquot = ext4_write_dquot ,
. acquire_dquot = ext4_acquire_dquot ,
. release_dquot = ext4_release_dquot ,
. mark_dirty = ext4_mark_dquot_dirty ,
. write_info = ext4_write_info ,
. alloc_dquot = dquot_alloc ,
. destroy_dquot = dquot_destroy ,
. get_projid = ext4_get_projid ,
. get_inode_usage = ext4_get_inode_usage ,
. get_next_id = ext4_get_next_id ,
2006-10-11 12:20:50 +04:00
} ;
2009-09-22 04:01:09 +04:00
static const struct quotactl_ops ext4_qctl_operations = {
2006-10-11 12:20:53 +04:00
. quota_on = ext4_quota_on ,
2010-08-02 01:48:36 +04:00
. quota_off = ext4_quota_off ,
2010-05-19 15:16:45 +04:00
. quota_sync = dquot_quota_sync ,
2014-11-19 02:42:09 +03:00
. get_state = dquot_get_state ,
2010-05-19 15:16:45 +04:00
. set_info = dquot_set_dqinfo ,
. get_dqblk = dquot_get_dqblk ,
2016-02-19 21:19:01 +03:00
. set_dqblk = dquot_set_dqblk ,
. get_nextdqblk = dquot_get_next_dqblk ,
2006-10-11 12:20:50 +04:00
} ;
# endif
2007-02-12 11:55:41 +03:00
static const struct super_operations ext4_sops = {
2006-10-11 12:20:53 +04:00
. alloc_inode = ext4_alloc_inode ,
. destroy_inode = ext4_destroy_inode ,
. write_inode = ext4_write_inode ,
. dirty_inode = ext4_dirty_inode ,
2010-11-08 21:51:33 +03:00
. drop_inode = ext4_drop_inode ,
2010-06-07 21:16:22 +04:00
. evict_inode = ext4_evict_inode ,
2006-10-11 12:20:53 +04:00
. put_super = ext4_put_super ,
. sync_fs = ext4_sync_fs ,
2009-01-10 03:40:58 +03:00
. freeze_fs = ext4_freeze ,
. unfreeze_fs = ext4_unfreeze ,
2006-10-11 12:20:53 +04:00
. statfs = ext4_statfs ,
. remount_fs = ext4_remount ,
. show_options = ext4_show_options ,
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2006-10-11 12:20:53 +04:00
. quota_read = ext4_quota_read ,
. quota_write = ext4_quota_write ,
2014-09-29 16:58:25 +04:00
. get_dquots = ext4_get_dquots ,
2006-10-11 12:20:50 +04:00
# endif
2009-01-06 06:38:48 +03:00
. bdev_try_to_free_page = bdev_try_to_free_page ,
2006-10-11 12:20:50 +04:00
} ;
2007-10-22 03:42:17 +04:00
static const struct export_operations ext4_export_ops = {
2007-10-22 03:42:08 +04:00
. fh_to_dentry = ext4_fh_to_dentry ,
. fh_to_parent = ext4_fh_to_parent ,
2006-10-11 12:20:53 +04:00
. get_parent = ext4_get_parent ,
2006-10-11 12:20:50 +04:00
} ;
enum {
Opt_bsd_df , Opt_minix_df , Opt_grpid , Opt_nogrpid ,
Opt_resgid , Opt_resuid , Opt_sb , Opt_err_cont , Opt_err_panic , Opt_err_ro ,
2012-03-04 03:04:40 +04:00
Opt_nouid32 , Opt_debug , Opt_removed ,
2006-10-11 12:20:50 +04:00
Opt_user_xattr , Opt_nouser_xattr , Opt_acl , Opt_noacl ,
2012-03-04 03:04:40 +04:00
Opt_auto_da_alloc , Opt_noauto_da_alloc , Opt_noload ,
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
Opt_commit , Opt_min_batch_time , Opt_max_batch_time , Opt_journal_dev ,
Opt_journal_path , Opt_journal_checksum , Opt_journal_async_commit ,
2006-10-11 12:20:50 +04:00
Opt_abort , Opt_data_journal , Opt_data_ordered , Opt_data_writeback ,
2015-04-16 08:56:00 +03:00
Opt_data_err_abort , Opt_data_err_ignore , Opt_test_dummy_encryption ,
2006-10-11 12:20:50 +04:00
Opt_usrjquota , Opt_grpjquota , Opt_offusrjquota , Opt_offgrpjquota ,
2009-12-01 01:58:32 +03:00
Opt_jqfmt_vfsold , Opt_jqfmt_vfsv0 , Opt_jqfmt_vfsv1 , Opt_quota ,
2012-03-02 21:14:24 +04:00
Opt_noquota , Opt_barrier , Opt_nobarrier , Opt_err ,
2016-09-06 06:08:16 +03:00
Opt_usrquota , Opt_grpquota , Opt_prjquota , Opt_i_version , Opt_dax ,
ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:
psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
Under memory pressure, it looks like part of a file can end up getting
replaced by zero's. Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page(). The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.
To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:
begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;
If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.
This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 23:27:50 +03:00
Opt_stripe , Opt_delalloc , Opt_nodelalloc , Opt_mblk_io_submit ,
2017-01-11 23:32:22 +03:00
Opt_lazytime , Opt_nolazytime , Opt_debug_want_extra_isize ,
ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:
psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
Under memory pressure, it looks like part of a file can end up getting
replaced by zero's. Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page(). The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.
To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:
begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;
If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.
This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.
Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 23:27:50 +03:00
Opt_nomblk_io_submit , Opt_block_validity , Opt_noblock_validity ,
2009-11-19 22:25:42 +03:00
Opt_inode_readahead_blks , Opt_journal_ioprio ,
2010-03-05 00:14:02 +03:00
Opt_dioread_nolock , Opt_dioread_lock ,
2011-12-13 07:06:18 +04:00
Opt_discard , Opt_nodiscard , Opt_init_itable , Opt_noinit_itable ,
2017-06-22 18:55:14 +03:00
Opt_max_dir_size_kb , Opt_nojournal_checksum , Opt_nombcache ,
2006-10-11 12:20:50 +04:00
} ;
2008-10-13 13:46:57 +04:00
static const match_table_t tokens = {
2006-10-11 12:20:50 +04:00
{ Opt_bsd_df , " bsddf " } ,
{ Opt_minix_df , " minixdf " } ,
{ Opt_grpid , " grpid " } ,
{ Opt_grpid , " bsdgroups " } ,
{ Opt_nogrpid , " nogrpid " } ,
{ Opt_nogrpid , " sysvgroups " } ,
{ Opt_resgid , " resgid=%u " } ,
{ Opt_resuid , " resuid=%u " } ,
{ Opt_sb , " sb=%u " } ,
{ Opt_err_cont , " errors=continue " } ,
{ Opt_err_panic , " errors=panic " } ,
{ Opt_err_ro , " errors=remount-ro " } ,
{ Opt_nouid32 , " nouid32 " } ,
{ Opt_debug , " debug " } ,
2012-03-04 03:04:40 +04:00
{ Opt_removed , " oldalloc " } ,
{ Opt_removed , " orlov " } ,
2006-10-11 12:20:50 +04:00
{ Opt_user_xattr , " user_xattr " } ,
{ Opt_nouser_xattr , " nouser_xattr " } ,
{ Opt_acl , " acl " } ,
{ Opt_noacl , " noacl " } ,
2009-11-19 22:28:50 +03:00
{ Opt_noload , " norecovery " } ,
2012-03-05 04:27:31 +04:00
{ Opt_noload , " noload " } ,
2012-03-04 03:04:40 +04:00
{ Opt_removed , " nobh " } ,
{ Opt_removed , " bh " } ,
2006-10-11 12:20:50 +04:00
{ Opt_commit , " commit=%u " } ,
2009-01-04 04:27:38 +03:00
{ Opt_min_batch_time , " min_batch_time=%u " } ,
{ Opt_max_batch_time , " max_batch_time=%u " } ,
2006-10-11 12:20:50 +04:00
{ Opt_journal_dev , " journal_dev=%u " } ,
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
{ Opt_journal_path , " journal_path=%s " } ,
2008-01-29 07:58:27 +03:00
{ Opt_journal_checksum , " journal_checksum " } ,
2014-11-26 00:20:50 +03:00
{ Opt_nojournal_checksum , " nojournal_checksum " } ,
2008-01-29 07:58:27 +03:00
{ Opt_journal_async_commit , " journal_async_commit " } ,
2006-10-11 12:20:50 +04:00
{ Opt_abort , " abort " } ,
{ Opt_data_journal , " data=journal " } ,
{ Opt_data_ordered , " data=ordered " } ,
{ Opt_data_writeback , " data=writeback " } ,
2008-10-11 06:12:43 +04:00
{ Opt_data_err_abort , " data_err=abort " } ,
{ Opt_data_err_ignore , " data_err=ignore " } ,
2006-10-11 12:20:50 +04:00
{ Opt_offusrjquota , " usrjquota= " } ,
{ Opt_usrjquota , " usrjquota=%s " } ,
{ Opt_offgrpjquota , " grpjquota= " } ,
{ Opt_grpjquota , " grpjquota=%s " } ,
{ Opt_jqfmt_vfsold , " jqfmt=vfsold " } ,
{ Opt_jqfmt_vfsv0 , " jqfmt=vfsv0 " } ,
2009-12-01 01:58:32 +03:00
{ Opt_jqfmt_vfsv1 , " jqfmt=vfsv1 " } ,
2006-10-11 12:20:50 +04:00
{ Opt_grpquota , " grpquota " } ,
{ Opt_noquota , " noquota " } ,
{ Opt_quota , " quota " } ,
{ Opt_usrquota , " usrquota " } ,
2016-09-06 06:08:16 +03:00
{ Opt_prjquota , " prjquota " } ,
2006-10-11 12:20:50 +04:00
{ Opt_barrier , " barrier=%u " } ,
2009-03-28 17:59:57 +03:00
{ Opt_barrier , " barrier " } ,
{ Opt_nobarrier , " nobarrier " } ,
2008-01-29 07:58:27 +03:00
{ Opt_i_version , " i_version " } ,
2015-02-17 02:59:38 +03:00
{ Opt_dax , " dax " } ,
2008-01-29 08:19:52 +03:00
{ Opt_stripe , " stripe=%u " } ,
2008-07-12 03:27:31 +04:00
{ Opt_delalloc , " delalloc " } ,
2015-02-02 08:37:02 +03:00
{ Opt_lazytime , " lazytime " } ,
{ Opt_nolazytime , " nolazytime " } ,
2017-01-11 23:32:22 +03:00
{ Opt_debug_want_extra_isize , " debug_want_extra_isize=%u " } ,
2008-07-12 03:27:31 +04:00
{ Opt_nodelalloc , " nodelalloc " } ,
2013-01-28 18:30:52 +04:00
{ Opt_removed , " mblk_io_submit " } ,
{ Opt_removed , " nomblk_io_submit " } ,
2009-05-17 23:38:01 +04:00
{ Opt_block_validity , " block_validity " } ,
{ Opt_noblock_validity , " noblock_validity " } ,
2008-10-10 07:53:47 +04:00
{ Opt_inode_readahead_blks , " inode_readahead_blks=%u " } ,
2009-01-06 06:46:26 +03:00
{ Opt_journal_ioprio , " journal_ioprio=%u " } ,
2009-03-17 06:12:23 +03:00
{ Opt_auto_da_alloc , " auto_da_alloc=%u " } ,
2009-03-28 17:59:57 +03:00
{ Opt_auto_da_alloc , " auto_da_alloc " } ,
{ Opt_noauto_da_alloc , " noauto_da_alloc " } ,
2010-03-05 00:14:02 +03:00
{ Opt_dioread_nolock , " dioread_nolock " } ,
{ Opt_dioread_lock , " dioread_lock " } ,
2009-11-19 22:25:42 +03:00
{ Opt_discard , " discard " } ,
{ Opt_nodiscard , " nodiscard " } ,
2011-12-13 07:06:18 +04:00
{ Opt_init_itable , " init_itable=%u " } ,
{ Opt_init_itable , " init_itable " } ,
{ Opt_noinit_itable , " noinit_itable " } ,
2012-08-17 17:48:17 +04:00
{ Opt_max_dir_size_kb , " max_dir_size_kb=%u " } ,
2015-04-16 08:56:00 +03:00
{ Opt_test_dummy_encryption , " test_dummy_encryption " } ,
2017-06-22 18:55:14 +03:00
{ Opt_nombcache , " nombcache " } ,
{ Opt_nombcache , " no_mbcache " } , /* for backward compatibility */
2012-03-05 07:00:53 +04:00
{ Opt_removed , " check=none " } , /* mount option from ext2/3 */
{ Opt_removed , " nocheck " } , /* mount option from ext2/3 */
{ Opt_removed , " reservation " } , /* mount option from ext2/3 */
{ Opt_removed , " noreservation " } , /* mount option from ext2/3 */
{ Opt_removed , " journal=%u " } , /* mount option from ext2/3 */
2008-04-30 06:05:28 +04:00
{ Opt_err , NULL } ,
2006-10-11 12:20:50 +04:00
} ;
2006-10-11 12:20:53 +04:00
static ext4_fsblk_t get_sb_block ( void * * data )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
ext4_fsblk_t sb_block ;
2006-10-11 12:20:50 +04:00
char * options = ( char * ) * data ;
if ( ! options | | strncmp ( options , " sb= " , 3 ) ! = 0 )
return 1 ; /* Default location */
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
options + = 3 ;
2009-06-04 01:59:28 +04:00
/* TODO: use simple_strtoll with >32bit ext4 */
2006-10-11 12:20:50 +04:00
sb_block = simple_strtoul ( options , & options , 0 ) ;
if ( * options & & * options ! = ' , ' ) {
2008-09-09 07:00:52 +04:00
printk ( KERN_ERR " EXT4-fs: Invalid sb specification: %s \n " ,
2006-10-11 12:20:50 +04:00
( char * ) * data ) ;
return 1 ;
}
if ( * options = = ' , ' )
options + + ;
* data = ( void * ) options ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
return sb_block ;
}
2009-01-06 06:46:26 +03:00
# define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
2017-04-30 06:47:50 +03:00
static const char deprecated_msg [ ] =
" Mount option \" %s \" will be removed by %s \n "
2010-03-02 06:29:21 +03:00
" Contact linux-ext4@vger.kernel.org if you think we should keep it. \n " ;
2009-01-06 06:46:26 +03:00
2010-03-02 07:28:41 +03:00
# ifdef CONFIG_QUOTA
static int set_qf_name ( struct super_block * sb , int qtype , substring_t * args )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
char * qname ;
2013-01-25 08:24:58 +04:00
int ret = - 1 ;
2010-03-02 07:28:41 +03:00
if ( sb_any_quota_loaded ( sb ) & &
! sbi - > s_qf_names [ qtype ] ) {
ext4_msg ( sb , KERN_ERR ,
" Cannot change journaled "
" quota options when quota turned on " ) ;
2012-04-17 02:55:26 +04:00
return - 1 ;
2010-03-02 07:28:41 +03:00
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) ) {
2016-04-04 00:03:37 +03:00
ext4_msg ( sb , KERN_INFO , " Journaled quota options "
" ignored when QUOTA feature is enabled " ) ;
return 1 ;
2013-03-03 02:57:08 +04:00
}
2010-03-02 07:28:41 +03:00
qname = match_strdup ( args ) ;
if ( ! qname ) {
ext4_msg ( sb , KERN_ERR ,
" Not enough memory for storing quotafile name " ) ;
2012-04-17 02:55:26 +04:00
return - 1 ;
2010-03-02 07:28:41 +03:00
}
2013-01-25 08:24:58 +04:00
if ( sbi - > s_qf_names [ qtype ] ) {
if ( strcmp ( sbi - > s_qf_names [ qtype ] , qname ) = = 0 )
ret = 1 ;
else
ext4_msg ( sb , KERN_ERR ,
" %s quota file already specified " ,
QTYPE2NAME ( qtype ) ) ;
goto errout ;
2010-03-02 07:28:41 +03:00
}
2013-01-25 08:24:58 +04:00
if ( strchr ( qname , ' / ' ) ) {
2010-03-02 07:28:41 +03:00
ext4_msg ( sb , KERN_ERR ,
" quotafile must be on filesystem root " ) ;
2013-01-25 08:24:58 +04:00
goto errout ;
2010-03-02 07:28:41 +03:00
}
2013-01-25 08:24:58 +04:00
sbi - > s_qf_names [ qtype ] = qname ;
2010-12-16 04:26:48 +03:00
set_opt ( sb , QUOTA ) ;
2010-03-02 07:28:41 +03:00
return 1 ;
2013-01-25 08:24:58 +04:00
errout :
kfree ( qname ) ;
return ret ;
2010-03-02 07:28:41 +03:00
}
static int clear_qf_name ( struct super_block * sb , int qtype )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
if ( sb_any_quota_loaded ( sb ) & &
sbi - > s_qf_names [ qtype ] ) {
ext4_msg ( sb , KERN_ERR , " Cannot change journaled quota options "
" when quota turned on " ) ;
2012-04-17 02:55:26 +04:00
return - 1 ;
2010-03-02 07:28:41 +03:00
}
2013-01-25 08:24:58 +04:00
kfree ( sbi - > s_qf_names [ qtype ] ) ;
2010-03-02 07:28:41 +03:00
sbi - > s_qf_names [ qtype ] = NULL ;
return 1 ;
}
# endif
2012-03-04 08:20:47 +04:00
# define MOPT_SET 0x0001
# define MOPT_CLEAR 0x0002
# define MOPT_NOSUPPORT 0x0004
# define MOPT_EXPLICIT 0x0008
# define MOPT_CLEAR_ERR 0x0010
# define MOPT_GTE0 0x0020
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2012-03-04 08:20:47 +04:00
# define MOPT_Q 0
# define MOPT_QFMT 0x0040
# else
# define MOPT_Q MOPT_NOSUPPORT
# define MOPT_QFMT MOPT_NOSUPPORT
2006-10-11 12:20:50 +04:00
# endif
2012-03-04 08:20:47 +04:00
# define MOPT_DATAJ 0x0080
2013-02-03 08:38:39 +04:00
# define MOPT_NO_EXT2 0x0100
# define MOPT_NO_EXT3 0x0200
# define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
# define MOPT_STRING 0x0400
2012-03-04 08:20:47 +04:00
static const struct mount_opts {
int token ;
int mount_opt ;
int flags ;
} ext4_mount_opts [ ] = {
{ Opt_minix_df , EXT4_MOUNT_MINIX_DF , MOPT_SET } ,
{ Opt_bsd_df , EXT4_MOUNT_MINIX_DF , MOPT_CLEAR } ,
{ Opt_grpid , EXT4_MOUNT_GRPID , MOPT_SET } ,
{ Opt_nogrpid , EXT4_MOUNT_GRPID , MOPT_CLEAR } ,
{ Opt_block_validity , EXT4_MOUNT_BLOCK_VALIDITY , MOPT_SET } ,
{ Opt_noblock_validity , EXT4_MOUNT_BLOCK_VALIDITY , MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_dioread_nolock , EXT4_MOUNT_DIOREAD_NOLOCK ,
MOPT_EXT4_ONLY | MOPT_SET } ,
{ Opt_dioread_lock , EXT4_MOUNT_DIOREAD_NOLOCK ,
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2012-03-04 08:20:47 +04:00
{ Opt_discard , EXT4_MOUNT_DISCARD , MOPT_SET } ,
{ Opt_nodiscard , EXT4_MOUNT_DISCARD , MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_delalloc , EXT4_MOUNT_DELALLOC ,
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
{ Opt_nodelalloc , EXT4_MOUNT_DELALLOC ,
2013-08-09 07:01:24 +04:00
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2014-11-26 00:20:50 +03:00
{ Opt_nojournal_checksum , EXT4_MOUNT_JOURNAL_CHECKSUM ,
MOPT_EXT4_ONLY | MOPT_CLEAR } ,
2013-02-03 08:38:39 +04:00
{ Opt_journal_checksum , EXT4_MOUNT_JOURNAL_CHECKSUM ,
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
2012-03-04 08:20:47 +04:00
{ Opt_journal_async_commit , ( EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
2013-02-03 08:38:39 +04:00
EXT4_MOUNT_JOURNAL_CHECKSUM ) ,
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT } ,
2013-02-03 08:38:39 +04:00
{ Opt_noload , EXT4_MOUNT_NOLOAD , MOPT_NO_EXT2 | MOPT_SET } ,
2012-03-04 08:20:47 +04:00
{ Opt_err_panic , EXT4_MOUNT_ERRORS_PANIC , MOPT_SET | MOPT_CLEAR_ERR } ,
{ Opt_err_ro , EXT4_MOUNT_ERRORS_RO , MOPT_SET | MOPT_CLEAR_ERR } ,
{ Opt_err_cont , EXT4_MOUNT_ERRORS_CONT , MOPT_SET | MOPT_CLEAR_ERR } ,
2013-02-03 08:38:39 +04:00
{ Opt_data_err_abort , EXT4_MOUNT_DATA_ERR_ABORT ,
2016-03-13 05:55:50 +03:00
MOPT_NO_EXT2 } ,
2013-02-03 08:38:39 +04:00
{ Opt_data_err_ignore , EXT4_MOUNT_DATA_ERR_ABORT ,
2016-03-13 05:55:50 +03:00
MOPT_NO_EXT2 } ,
2012-03-04 08:20:47 +04:00
{ Opt_barrier , EXT4_MOUNT_BARRIER , MOPT_SET } ,
{ Opt_nobarrier , EXT4_MOUNT_BARRIER , MOPT_CLEAR } ,
{ Opt_noauto_da_alloc , EXT4_MOUNT_NO_AUTO_DA_ALLOC , MOPT_SET } ,
{ Opt_auto_da_alloc , EXT4_MOUNT_NO_AUTO_DA_ALLOC , MOPT_CLEAR } ,
{ Opt_noinit_itable , EXT4_MOUNT_INIT_INODE_TABLE , MOPT_CLEAR } ,
{ Opt_commit , 0 , MOPT_GTE0 } ,
{ Opt_max_batch_time , 0 , MOPT_GTE0 } ,
{ Opt_min_batch_time , 0 , MOPT_GTE0 } ,
{ Opt_inode_readahead_blks , 0 , MOPT_GTE0 } ,
{ Opt_init_itable , 0 , MOPT_GTE0 } ,
2015-02-17 02:59:38 +03:00
{ Opt_dax , EXT4_MOUNT_DAX , MOPT_SET } ,
2012-03-04 08:20:47 +04:00
{ Opt_stripe , 0 , MOPT_GTE0 } ,
2013-02-03 07:52:19 +04:00
{ Opt_resuid , 0 , MOPT_GTE0 } ,
{ Opt_resgid , 0 , MOPT_GTE0 } ,
2015-07-22 06:57:59 +03:00
{ Opt_journal_dev , 0 , MOPT_NO_EXT2 | MOPT_GTE0 } ,
{ Opt_journal_path , 0 , MOPT_NO_EXT2 | MOPT_STRING } ,
{ Opt_journal_ioprio , 0 , MOPT_NO_EXT2 | MOPT_GTE0 } ,
2013-02-03 08:38:39 +04:00
{ Opt_data_journal , EXT4_MOUNT_JOURNAL_DATA , MOPT_NO_EXT2 | MOPT_DATAJ } ,
{ Opt_data_ordered , EXT4_MOUNT_ORDERED_DATA , MOPT_NO_EXT2 | MOPT_DATAJ } ,
{ Opt_data_writeback , EXT4_MOUNT_WRITEBACK_DATA ,
MOPT_NO_EXT2 | MOPT_DATAJ } ,
2012-03-04 08:20:47 +04:00
{ Opt_user_xattr , EXT4_MOUNT_XATTR_USER , MOPT_SET } ,
{ Opt_nouser_xattr , EXT4_MOUNT_XATTR_USER , MOPT_CLEAR } ,
2008-10-11 04:02:48 +04:00
# ifdef CONFIG_EXT4_FS_POSIX_ACL
2012-03-04 08:20:47 +04:00
{ Opt_acl , EXT4_MOUNT_POSIX_ACL , MOPT_SET } ,
{ Opt_noacl , EXT4_MOUNT_POSIX_ACL , MOPT_CLEAR } ,
2006-10-11 12:20:50 +04:00
# else
2012-03-04 08:20:47 +04:00
{ Opt_acl , 0 , MOPT_NOSUPPORT } ,
{ Opt_noacl , 0 , MOPT_NOSUPPORT } ,
2006-10-11 12:20:50 +04:00
# endif
2012-03-04 08:20:47 +04:00
{ Opt_nouid32 , EXT4_MOUNT_NO_UID32 , MOPT_SET } ,
{ Opt_debug , EXT4_MOUNT_DEBUG , MOPT_SET } ,
2017-01-11 23:32:22 +03:00
{ Opt_debug_want_extra_isize , 0 , MOPT_GTE0 } ,
2012-03-04 08:20:47 +04:00
{ Opt_quota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA , MOPT_SET | MOPT_Q } ,
{ Opt_usrquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA ,
MOPT_SET | MOPT_Q } ,
{ Opt_grpquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_GRPQUOTA ,
MOPT_SET | MOPT_Q } ,
2016-09-06 06:08:16 +03:00
{ Opt_prjquota , EXT4_MOUNT_QUOTA | EXT4_MOUNT_PRJQUOTA ,
MOPT_SET | MOPT_Q } ,
2012-03-04 08:20:47 +04:00
{ Opt_noquota , ( EXT4_MOUNT_QUOTA | EXT4_MOUNT_USRQUOTA |
2016-09-06 06:08:16 +03:00
EXT4_MOUNT_GRPQUOTA | EXT4_MOUNT_PRJQUOTA ) ,
MOPT_CLEAR | MOPT_Q } ,
2012-03-04 08:20:47 +04:00
{ Opt_usrjquota , 0 , MOPT_Q } ,
{ Opt_grpjquota , 0 , MOPT_Q } ,
{ Opt_offusrjquota , 0 , MOPT_Q } ,
{ Opt_offgrpjquota , 0 , MOPT_Q } ,
{ Opt_jqfmt_vfsold , QFMT_VFS_OLD , MOPT_QFMT } ,
{ Opt_jqfmt_vfsv0 , QFMT_VFS_V0 , MOPT_QFMT } ,
{ Opt_jqfmt_vfsv1 , QFMT_VFS_V1 , MOPT_QFMT } ,
2012-08-17 17:48:17 +04:00
{ Opt_max_dir_size_kb , 0 , MOPT_GTE0 } ,
2015-04-16 08:56:00 +03:00
{ Opt_test_dummy_encryption , 0 , MOPT_GTE0 } ,
2017-06-22 18:55:14 +03:00
{ Opt_nombcache , EXT4_MOUNT_NO_MBCACHE , MOPT_SET } ,
2012-03-04 08:20:47 +04:00
{ Opt_err , 0 , 0 }
} ;
static int handle_mount_opt ( struct super_block * sb , char * opt , int token ,
substring_t * args , unsigned long * journal_devnum ,
unsigned int * journal_ioprio , int is_remount )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
const struct mount_opts * m ;
2012-02-08 03:41:49 +04:00
kuid_t uid ;
kgid_t gid ;
2012-03-04 08:20:47 +04:00
int arg = 0 ;
2012-04-17 02:55:26 +04:00
# ifdef CONFIG_QUOTA
if ( token = = Opt_usrjquota )
return set_qf_name ( sb , USRQUOTA , & args [ 0 ] ) ;
else if ( token = = Opt_grpjquota )
return set_qf_name ( sb , GRPQUOTA , & args [ 0 ] ) ;
else if ( token = = Opt_offusrjquota )
return clear_qf_name ( sb , USRQUOTA ) ;
else if ( token = = Opt_offgrpjquota )
return clear_qf_name ( sb , GRPQUOTA ) ;
# endif
2012-03-04 08:20:47 +04:00
switch ( token ) {
2012-03-05 07:06:20 +04:00
case Opt_noacl :
case Opt_nouser_xattr :
ext4_msg ( sb , KERN_WARNING , deprecated_msg , opt , " 3.5 " ) ;
break ;
2012-03-04 08:20:47 +04:00
case Opt_sb :
return 1 ; /* handled by get_sb_block() */
case Opt_removed :
2013-02-03 08:09:36 +04:00
ext4_msg ( sb , KERN_WARNING , " Ignoring removed %s option " , opt ) ;
2012-03-04 08:20:47 +04:00
return 1 ;
case Opt_abort :
sbi - > s_mount_flags | = EXT4_MF_FS_ABORTED ;
return 1 ;
case Opt_i_version :
sb - > s_flags | = MS_I_VERSION ;
return 1 ;
2015-02-02 08:37:02 +03:00
case Opt_lazytime :
sb - > s_flags | = MS_LAZYTIME ;
return 1 ;
case Opt_nolazytime :
sb - > s_flags & = ~ MS_LAZYTIME ;
return 1 ;
2012-03-04 08:20:47 +04:00
}
2013-02-03 08:09:36 +04:00
for ( m = ext4_mount_opts ; m - > token ! = Opt_err ; m + + )
if ( token = = m - > token )
break ;
if ( m - > token = = Opt_err ) {
ext4_msg ( sb , KERN_ERR , " Unrecognized mount option \" %s \" "
" or missing value " , opt ) ;
return - 1 ;
}
2013-02-03 08:38:39 +04:00
if ( ( m - > flags & MOPT_NO_EXT2 ) & & IS_EXT2_SB ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" Mount option \" %s \" incompatible with ext2 " , opt ) ;
return - 1 ;
}
if ( ( m - > flags & MOPT_NO_EXT3 ) & & IS_EXT3_SB ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" Mount option \" %s \" incompatible with ext3 " , opt ) ;
return - 1 ;
}
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
if ( args - > from & & ! ( m - > flags & MOPT_STRING ) & & match_int ( args , & arg ) )
2013-02-03 08:09:36 +04:00
return - 1 ;
if ( args - > from & & ( m - > flags & MOPT_GTE0 ) & & ( arg < 0 ) )
return - 1 ;
2015-10-19 06:35:32 +03:00
if ( m - > flags & MOPT_EXPLICIT ) {
if ( m - > mount_opt & EXT4_MOUNT_DELALLOC ) {
set_opt2 ( sb , EXPLICIT_DELALLOC ) ;
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
} else if ( m - > mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM ) {
set_opt2 ( sb , EXPLICIT_JOURNAL_CHECKSUM ) ;
2015-10-19 06:35:32 +03:00
} else
return - 1 ;
}
2013-02-03 08:09:36 +04:00
if ( m - > flags & MOPT_CLEAR_ERR )
clear_opt ( sb , ERRORS_MASK ) ;
if ( token = = Opt_noquota & & sb_any_quota_loaded ( sb ) ) {
ext4_msg ( sb , KERN_ERR , " Cannot change quota "
" options when quota turned on " ) ;
return - 1 ;
}
if ( m - > flags & MOPT_NOSUPPORT ) {
ext4_msg ( sb , KERN_ERR , " %s option not supported " , opt ) ;
} else if ( token = = Opt_commit ) {
if ( arg = = 0 )
arg = JBD2_DEFAULT_MAX_COMMIT_AGE ;
sbi - > s_commit_interval = HZ * arg ;
2017-01-11 23:32:22 +03:00
} else if ( token = = Opt_debug_want_extra_isize ) {
sbi - > s_want_extra_isize = arg ;
2013-02-03 08:09:36 +04:00
} else if ( token = = Opt_max_batch_time ) {
sbi - > s_max_batch_time = arg ;
} else if ( token = = Opt_min_batch_time ) {
sbi - > s_min_batch_time = arg ;
} else if ( token = = Opt_inode_readahead_blks ) {
2013-02-03 08:14:31 +04:00
if ( arg & & ( arg > ( 1 < < 30 ) | | ! is_power_of_2 ( arg ) ) ) {
ext4_msg ( sb , KERN_ERR ,
" EXT4-fs: inode_readahead_blks must be "
" 0 or a power of 2 smaller than 2^31 " ) ;
2012-03-04 08:20:47 +04:00
return - 1 ;
2013-02-03 08:09:36 +04:00
}
sbi - > s_inode_readahead_blks = arg ;
} else if ( token = = Opt_init_itable ) {
set_opt ( sb , INIT_INODE_TABLE ) ;
if ( ! args - > from )
arg = EXT4_DEF_LI_WAIT_MULT ;
sbi - > s_li_wait_mult = arg ;
} else if ( token = = Opt_max_dir_size_kb ) {
sbi - > s_max_dir_size_kb = arg ;
} else if ( token = = Opt_stripe ) {
sbi - > s_stripe = arg ;
} else if ( token = = Opt_resuid ) {
uid = make_kuid ( current_user_ns ( ) , arg ) ;
if ( ! uid_valid ( uid ) ) {
ext4_msg ( sb , KERN_ERR , " Invalid uid value %d " , arg ) ;
2012-03-04 08:20:47 +04:00
return - 1 ;
}
2013-02-03 08:09:36 +04:00
sbi - > s_resuid = uid ;
} else if ( token = = Opt_resgid ) {
gid = make_kgid ( current_user_ns ( ) , arg ) ;
if ( ! gid_valid ( gid ) ) {
ext4_msg ( sb , KERN_ERR , " Invalid gid value %d " , arg ) ;
return - 1 ;
}
sbi - > s_resgid = gid ;
} else if ( token = = Opt_journal_dev ) {
if ( is_remount ) {
ext4_msg ( sb , KERN_ERR ,
" Cannot specify journal on remount " ) ;
return - 1 ;
}
* journal_devnum = arg ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
} else if ( token = = Opt_journal_path ) {
char * journal_path ;
struct inode * journal_inode ;
struct path path ;
int error ;
if ( is_remount ) {
ext4_msg ( sb , KERN_ERR ,
" Cannot specify journal on remount " ) ;
return - 1 ;
}
journal_path = match_strdup ( & args [ 0 ] ) ;
if ( ! journal_path ) {
ext4_msg ( sb , KERN_ERR , " error: could not dup "
" journal device string " ) ;
return - 1 ;
}
error = kern_path ( journal_path , LOOKUP_FOLLOW , & path ) ;
if ( error ) {
ext4_msg ( sb , KERN_ERR , " error: could not find "
" journal device path: error %d " , error ) ;
kfree ( journal_path ) ;
return - 1 ;
}
2015-03-18 01:25:59 +03:00
journal_inode = d_inode ( path . dentry ) ;
ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-29 03:05:07 +04:00
if ( ! S_ISBLK ( journal_inode - > i_mode ) ) {
ext4_msg ( sb , KERN_ERR , " error: journal path %s "
" is not a block device " , journal_path ) ;
path_put ( & path ) ;
kfree ( journal_path ) ;
return - 1 ;
}
* journal_devnum = new_encode_dev ( journal_inode - > i_rdev ) ;
path_put ( & path ) ;
kfree ( journal_path ) ;
2013-02-03 08:09:36 +04:00
} else if ( token = = Opt_journal_ioprio ) {
if ( arg > 7 ) {
ext4_msg ( sb , KERN_ERR , " Invalid journal IO priority "
" (must be 0-7) " ) ;
return - 1 ;
}
* journal_ioprio =
IOPRIO_PRIO_VALUE ( IOPRIO_CLASS_BE , arg ) ;
2015-04-16 08:56:00 +03:00
} else if ( token = = Opt_test_dummy_encryption ) {
# ifdef CONFIG_EXT4_FS_ENCRYPTION
sbi - > s_mount_flags | = EXT4_MF_TEST_DUMMY_ENCRYPTION ;
ext4_msg ( sb , KERN_WARNING ,
" Test dummy encryption mode enabled " ) ;
# else
ext4_msg ( sb , KERN_WARNING ,
" Test dummy encryption mount option ignored " ) ;
# endif
2013-02-03 08:09:36 +04:00
} else if ( m - > flags & MOPT_DATAJ ) {
if ( is_remount ) {
if ( ! sbi - > s_journal )
ext4_msg ( sb , KERN_WARNING , " Remounting file system with no journal so ignoring journalled data option " ) ;
else if ( test_opt ( sb , DATA_FLAGS ) ! = m - > mount_opt ) {
2013-02-03 07:52:19 +04:00
ext4_msg ( sb , KERN_ERR ,
2012-03-04 08:20:47 +04:00
" Cannot change data mode on remount " ) ;
return - 1 ;
2006-10-11 12:20:50 +04:00
}
2012-03-04 08:20:47 +04:00
} else {
2013-02-03 08:09:36 +04:00
clear_opt ( sb , DATA_FLAGS ) ;
sbi - > s_mount_opt | = m - > mount_opt ;
2006-10-11 12:20:50 +04:00
}
2013-02-03 08:09:36 +04:00
# ifdef CONFIG_QUOTA
} else if ( m - > flags & MOPT_QFMT ) {
if ( sb_any_quota_loaded ( sb ) & &
sbi - > s_jquota_fmt ! = m - > mount_opt ) {
ext4_msg ( sb , KERN_ERR , " Cannot change journaled "
" quota options when quota turned on " ) ;
return - 1 ;
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) ) {
2016-04-04 00:03:37 +03:00
ext4_msg ( sb , KERN_INFO ,
" Quota format mount options ignored "
2013-03-03 02:57:08 +04:00
" when QUOTA feature is enabled " ) ;
2016-04-04 00:03:37 +03:00
return 1 ;
2013-03-03 02:57:08 +04:00
}
2013-02-03 08:09:36 +04:00
sbi - > s_jquota_fmt = m - > mount_opt ;
2015-02-17 02:59:38 +03:00
# endif
} else if ( token = = Opt_dax ) {
2015-09-29 22:48:11 +03:00
# ifdef CONFIG_FS_DAX
ext4_msg ( sb , KERN_WARNING ,
" DAX enabled. Warning: EXPERIMENTAL, use at your own risk " ) ;
sbi - > s_mount_opt | = m - > mount_opt ;
# else
2015-02-17 02:59:38 +03:00
ext4_msg ( sb , KERN_INFO , " dax option not supported " ) ;
return - 1 ;
2013-02-03 08:09:36 +04:00
# endif
2016-03-13 05:55:50 +03:00
} else if ( token = = Opt_data_err_abort ) {
sbi - > s_mount_opt | = m - > mount_opt ;
} else if ( token = = Opt_data_err_ignore ) {
sbi - > s_mount_opt & = ~ m - > mount_opt ;
2013-02-03 08:09:36 +04:00
} else {
if ( ! args - > from )
arg = 1 ;
if ( m - > flags & MOPT_CLEAR )
arg = ! arg ;
else if ( unlikely ( ! ( m - > flags & MOPT_SET ) ) ) {
ext4_msg ( sb , KERN_WARNING ,
" buggy handling of option %s " , opt ) ;
WARN_ON ( 1 ) ;
return - 1 ;
}
if ( arg ! = 0 )
sbi - > s_mount_opt | = m - > mount_opt ;
else
sbi - > s_mount_opt & = ~ m - > mount_opt ;
2012-03-04 08:20:47 +04:00
}
2013-02-03 08:09:36 +04:00
return 1 ;
2012-03-04 08:20:47 +04:00
}
static int parse_options ( char * options , struct super_block * sb ,
unsigned long * journal_devnum ,
unsigned int * journal_ioprio ,
int is_remount )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
char * p ;
substring_t args [ MAX_OPT_ARGS ] ;
int token ;
if ( ! options )
return 1 ;
while ( ( p = strsep ( & options , " , " ) ) ! = NULL ) {
if ( ! * p )
continue ;
/*
* Initialize args struct so we know whether arg was
* found ; some options take optional arguments .
*/
2012-08-19 06:29:18 +04:00
args [ 0 ] . to = args [ 0 ] . from = NULL ;
2012-03-04 08:20:47 +04:00
token = match_token ( p , tokens , args ) ;
if ( handle_mount_opt ( sb , p , token , args , journal_devnum ,
journal_ioprio , is_remount ) < 0 )
return 0 ;
2006-10-11 12:20:50 +04:00
}
# ifdef CONFIG_QUOTA
2016-09-06 06:08:16 +03:00
/*
* We do the test below only for project quotas . ' usrquota ' and
* ' grpquota ' mount options are allowed even without quota feature
* to support legacy quotas in quota files .
*/
if ( test_opt ( sb , PRJQUOTA ) & & ! ext4_has_feature_project ( sb ) ) {
ext4_msg ( sb , KERN_ERR , " Project quota feature not enabled. "
" Cannot enable project quota enforcement. " ) ;
return 0 ;
}
if ( sbi - > s_qf_names [ USRQUOTA ] | | sbi - > s_qf_names [ GRPQUOTA ] ) {
2010-02-24 19:35:32 +03:00
if ( test_opt ( sb , USRQUOTA ) & & sbi - > s_qf_names [ USRQUOTA ] )
2010-12-16 04:26:48 +03:00
clear_opt ( sb , USRQUOTA ) ;
2006-10-11 12:20:50 +04:00
2010-02-24 19:35:32 +03:00
if ( test_opt ( sb , GRPQUOTA ) & & sbi - > s_qf_names [ GRPQUOTA ] )
2010-12-16 04:26:48 +03:00
clear_opt ( sb , GRPQUOTA ) ;
2006-10-11 12:20:50 +04:00
2010-03-02 07:28:41 +03:00
if ( test_opt ( sb , GRPQUOTA ) | | test_opt ( sb , USRQUOTA ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " old and new quota "
" format mixing " ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
if ( ! sbi - > s_jquota_fmt ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " journaled quota format "
" not specified " ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
}
# endif
2012-12-20 09:07:18 +04:00
if ( test_opt ( sb , DIOREAD_NOLOCK ) ) {
int blocksize =
BLOCK_SIZE < < le32_to_cpu ( sbi - > s_es - > s_log_block_size ) ;
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
if ( blocksize < PAGE_SIZE ) {
2012-12-20 09:07:18 +04:00
ext4_msg ( sb , KERN_ERR , " can't mount with "
" dioread_nolock if block size != PAGE_SIZE " ) ;
return 0 ;
}
}
2006-10-11 12:20:50 +04:00
return 1 ;
}
2012-03-04 08:20:50 +04:00
static inline void ext4_show_quota_options ( struct seq_file * seq ,
struct super_block * sb )
{
# if defined(CONFIG_QUOTA)
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
if ( sbi - > s_jquota_fmt ) {
char * fmtname = " " ;
switch ( sbi - > s_jquota_fmt ) {
case QFMT_VFS_OLD :
fmtname = " vfsold " ;
break ;
case QFMT_VFS_V0 :
fmtname = " vfsv0 " ;
break ;
case QFMT_VFS_V1 :
fmtname = " vfsv1 " ;
break ;
}
seq_printf ( seq , " ,jqfmt=%s " , fmtname ) ;
}
if ( sbi - > s_qf_names [ USRQUOTA ] )
fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 01:44:57 +03:00
seq_show_option ( seq , " usrjquota " , sbi - > s_qf_names [ USRQUOTA ] ) ;
2012-03-04 08:20:50 +04:00
if ( sbi - > s_qf_names [ GRPQUOTA ] )
fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-05 01:44:57 +03:00
seq_show_option ( seq , " grpjquota " , sbi - > s_qf_names [ GRPQUOTA ] ) ;
2012-03-04 08:20:50 +04:00
# endif
}
2012-03-05 04:27:31 +04:00
static const char * token2str ( int token )
{
2012-09-24 06:49:12 +04:00
const struct match_token * t ;
2012-03-05 04:27:31 +04:00
for ( t = tokens ; t - > token ! = Opt_err ; t + + )
if ( t - > token = = token & & ! strchr ( t - > pattern , ' = ' ) )
break ;
return t - > pattern ;
}
2012-03-04 08:20:50 +04:00
/*
* Show an option if
* - it ' s set to a non - default value OR
* - if the per - sb default is different from the global default
*/
2012-03-05 05:21:38 +04:00
static int _ext4_show_options ( struct seq_file * seq , struct super_block * sb ,
int nodefs )
2012-03-04 08:20:50 +04:00
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2012-03-05 05:21:38 +04:00
int def_errors , def_mount_opt = nodefs ? 0 : sbi - > s_def_mount_opt ;
2012-03-05 04:27:31 +04:00
const struct mount_opts * m ;
2012-03-05 05:21:38 +04:00
char sep = nodefs ? ' \n ' : ' , ' ;
2012-03-04 08:20:50 +04:00
2012-03-05 05:21:38 +04:00
# define SEQ_OPTS_PUTS(str) seq_printf(seq, "%c" str, sep)
# define SEQ_OPTS_PRINT(str, arg) seq_printf(seq, "%c" str, sep, arg)
2012-03-04 08:20:50 +04:00
if ( sbi - > s_sb_block ! = 1 )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " sb=%llu " , sbi - > s_sb_block ) ;
for ( m = ext4_mount_opts ; m - > token ! = Opt_err ; m + + ) {
int want_set = m - > flags & MOPT_SET ;
if ( ( ( m - > flags & ( MOPT_SET | MOPT_CLEAR ) ) = = 0 ) | |
( m - > flags & MOPT_CLEAR_ERR ) )
continue ;
2012-03-05 05:21:38 +04:00
if ( ! ( m - > mount_opt & ( sbi - > s_mount_opt ^ def_mount_opt ) ) )
2012-03-05 04:27:31 +04:00
continue ; /* skip if same as the default */
if ( ( want_set & &
( sbi - > s_mount_opt & m - > mount_opt ) ! = m - > mount_opt ) | |
( ! want_set & & ( sbi - > s_mount_opt & m - > mount_opt ) ) )
continue ; /* select Opt_noFoo vs Opt_Foo */
SEQ_OPTS_PRINT ( " %s " , token2str ( m - > token ) ) ;
2012-03-04 08:20:50 +04:00
}
2012-03-05 04:27:31 +04:00
2012-02-08 03:41:49 +04:00
if ( nodefs | | ! uid_eq ( sbi - > s_resuid , make_kuid ( & init_user_ns , EXT4_DEF_RESUID ) ) | |
2012-03-05 04:27:31 +04:00
le16_to_cpu ( es - > s_def_resuid ) ! = EXT4_DEF_RESUID )
2012-02-08 03:41:49 +04:00
SEQ_OPTS_PRINT ( " resuid=%u " ,
from_kuid_munged ( & init_user_ns , sbi - > s_resuid ) ) ;
if ( nodefs | | ! gid_eq ( sbi - > s_resgid , make_kgid ( & init_user_ns , EXT4_DEF_RESGID ) ) | |
2012-03-05 04:27:31 +04:00
le16_to_cpu ( es - > s_def_resgid ) ! = EXT4_DEF_RESGID )
2012-02-08 03:41:49 +04:00
SEQ_OPTS_PRINT ( " resgid=%u " ,
from_kgid_munged ( & init_user_ns , sbi - > s_resgid ) ) ;
2012-03-05 05:21:38 +04:00
def_errors = nodefs ? - 1 : le16_to_cpu ( es - > s_errors ) ;
2012-03-05 04:27:31 +04:00
if ( test_opt ( sb , ERRORS_RO ) & & def_errors ! = EXT4_ERRORS_RO )
SEQ_OPTS_PUTS ( " errors=remount-ro " ) ;
2012-03-04 08:20:50 +04:00
if ( test_opt ( sb , ERRORS_CONT ) & & def_errors ! = EXT4_ERRORS_CONTINUE )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " errors=continue " ) ;
2012-03-04 08:20:50 +04:00
if ( test_opt ( sb , ERRORS_PANIC ) & & def_errors ! = EXT4_ERRORS_PANIC )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " errors=panic " ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_commit_interval ! = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " commit=%lu " , sbi - > s_commit_interval / HZ ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_min_batch_time ! = EXT4_DEF_MIN_BATCH_TIME )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " min_batch_time=%u " , sbi - > s_min_batch_time ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_max_batch_time ! = EXT4_DEF_MAX_BATCH_TIME )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " max_batch_time=%u " , sbi - > s_max_batch_time ) ;
2012-03-04 08:20:50 +04:00
if ( sb - > s_flags & MS_I_VERSION )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PUTS ( " i_version " ) ;
2012-03-05 05:21:38 +04:00
if ( nodefs | | sbi - > s_stripe )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " stripe=%lu " , sbi - > s_stripe ) ;
2012-03-05 05:21:38 +04:00
if ( EXT4_MOUNT_DATA_FLAGS & ( sbi - > s_mount_opt ^ def_mount_opt ) ) {
2012-03-05 04:27:31 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA )
SEQ_OPTS_PUTS ( " data=journal " ) ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA )
SEQ_OPTS_PUTS ( " data=ordered " ) ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_WRITEBACK_DATA )
SEQ_OPTS_PUTS ( " data=writeback " ) ;
}
2012-03-05 05:21:38 +04:00
if ( nodefs | |
sbi - > s_inode_readahead_blks ! = EXT4_DEF_INODE_READAHEAD_BLKS )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " inode_readahead_blks=%u " ,
sbi - > s_inode_readahead_blks ) ;
2012-03-04 08:20:50 +04:00
2012-03-05 05:21:38 +04:00
if ( nodefs | | ( test_opt ( sb , INIT_INODE_TABLE ) & &
( sbi - > s_li_wait_mult ! = EXT4_DEF_LI_WAIT_MULT ) ) )
2012-03-05 04:27:31 +04:00
SEQ_OPTS_PRINT ( " init_itable=%u " , sbi - > s_li_wait_mult ) ;
2012-08-17 17:48:17 +04:00
if ( nodefs | | sbi - > s_max_dir_size_kb )
SEQ_OPTS_PRINT ( " max_dir_size_kb=%u " , sbi - > s_max_dir_size_kb ) ;
2016-03-13 05:55:50 +03:00
if ( test_opt ( sb , DATA_ERR_ABORT ) )
SEQ_OPTS_PUTS ( " data_err=abort " ) ;
2012-03-04 08:20:50 +04:00
ext4_show_quota_options ( seq , sb ) ;
return 0 ;
}
2012-03-05 05:21:38 +04:00
static int ext4_show_options ( struct seq_file * seq , struct dentry * root )
{
return _ext4_show_options ( seq , root - > d_sb , 0 ) ;
}
2015-09-23 19:46:17 +03:00
int ext4_seq_options_show ( struct seq_file * seq , void * offset )
2012-03-05 05:21:38 +04:00
{
struct super_block * sb = seq - > private ;
int rc ;
seq_puts ( seq , ( sb - > s_flags & MS_RDONLY ) ? " ro " : " rw " ) ;
rc = _ext4_show_options ( seq , sb , 1 ) ;
seq_puts ( seq , " \n " ) ;
return rc ;
}
2006-10-11 12:20:53 +04:00
static int ext4_setup_super ( struct super_block * sb , struct ext4_super_block * es ,
2006-10-11 12:20:50 +04:00
int read_only )
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
int res = 0 ;
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) > EXT4_MAX_SUPP_REV ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " revision level too high, "
" forcing read-only mode " ) ;
2006-10-11 12:20:50 +04:00
res = MS_RDONLY ;
}
if ( read_only )
2011-09-10 02:34:51 +04:00
goto done ;
2006-10-11 12:20:53 +04:00
if ( ! ( sbi - > s_mount_state & EXT4_VALID_FS ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " warning: mounting unchecked fs, "
" running e2fsck is recommended " ) ;
2014-05-12 20:55:07 +04:00
else if ( sbi - > s_mount_state & EXT4_ERROR_FS )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: mounting fs with errors, "
" running e2fsck is recommended " ) ;
2011-05-18 21:29:57 +04:00
else if ( ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) > 0 & &
2006-10-11 12:20:50 +04:00
le16_to_cpu ( es - > s_mnt_count ) > =
( unsigned short ) ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: maximal mount count reached, "
" running e2fsck is recommended " ) ;
2006-10-11 12:20:50 +04:00
else if ( le32_to_cpu ( es - > s_checkinterval ) & &
( le32_to_cpu ( es - > s_lastcheck ) +
le32_to_cpu ( es - > s_checkinterval ) < = get_seconds ( ) ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" warning: checktime reached, "
" running e2fsck is recommended " ) ;
2009-06-04 01:59:28 +04:00
if ( ! sbi - > s_journal )
2009-01-07 08:06:22 +03:00
es - > s_state & = cpu_to_le16 ( ~ EXT4_VALID_FS ) ;
2006-10-11 12:20:50 +04:00
if ( ! ( __s16 ) le16_to_cpu ( es - > s_max_mnt_count ) )
2006-10-11 12:20:53 +04:00
es - > s_max_mnt_count = cpu_to_le16 ( EXT4_DFL_MAX_MNT_COUNT ) ;
2008-04-17 18:38:59 +04:00
le16_add_cpu ( & es - > s_mnt_count , 1 ) ;
2006-10-11 12:20:50 +04:00
es - > s_mtime = cpu_to_le32 ( get_seconds ( ) ) ;
2006-10-11 12:20:53 +04:00
ext4_update_dynamic_rev ( sb ) ;
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal )
2015-10-17 23:18:43 +03:00
ext4_set_feature_journal_needs_recovery ( sb ) ;
2006-10-11 12:20:50 +04:00
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 1 ) ;
2011-09-10 02:34:51 +04:00
done :
2006-10-11 12:20:50 +04:00
if ( test_opt ( sb , DEBUG ) )
2009-01-06 06:18:16 +03:00
printk ( KERN_INFO " [EXT4 FS bs=%lu, gc=%u, "
2010-12-16 04:30:48 +03:00
" bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x] \n " ,
2006-10-11 12:20:50 +04:00
sb - > s_blocksize ,
sbi - > s_groups_count ,
2006-10-11 12:20:53 +04:00
EXT4_BLOCKS_PER_GROUP ( sb ) ,
EXT4_INODES_PER_GROUP ( sb ) ,
2010-12-16 04:30:48 +03:00
sbi - > s_mount_opt , sbi - > s_mount_opt2 ) ;
2006-10-11 12:20:50 +04:00
2011-05-26 20:02:03 +04:00
cleancache_init_fs ( sb ) ;
2006-10-11 12:20:50 +04:00
return res ;
}
2012-09-05 09:29:50 +04:00
int ext4_alloc_flex_bg_array ( struct super_block * sb , ext4_group_t ngroup )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct flex_groups * new_groups ;
int size ;
if ( ! sbi - > s_log_groups_per_flex )
return 0 ;
size = ext4_flex_group ( sbi , ngroup - 1 ) + 1 ;
if ( size < = sbi - > s_flex_groups_allocated )
return 0 ;
size = roundup_pow_of_two ( size * sizeof ( struct flex_groups ) ) ;
2017-05-09 01:57:09 +03:00
new_groups = kvzalloc ( size , GFP_KERNEL ) ;
2012-09-05 09:29:50 +04:00
if ( ! new_groups ) {
ext4_msg ( sb , KERN_ERR , " not enough memory for %d flex groups " ,
size / ( int ) sizeof ( struct flex_groups ) ) ;
return - ENOMEM ;
}
if ( sbi - > s_flex_groups ) {
memcpy ( new_groups , sbi - > s_flex_groups ,
( sbi - > s_flex_groups_allocated *
sizeof ( struct flex_groups ) ) ) ;
2014-11-20 20:19:11 +03:00
kvfree ( sbi - > s_flex_groups ) ;
2012-09-05 09:29:50 +04:00
}
sbi - > s_flex_groups = new_groups ;
sbi - > s_flex_groups_allocated = size / sizeof ( struct flex_groups ) ;
return 0 ;
}
2008-07-12 03:27:31 +04:00
static int ext4_fill_flex_info ( struct super_block * sb )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_desc * gdp = NULL ;
ext4_group_t flex_group ;
2012-09-05 09:29:50 +04:00
int i , err ;
2008-07-12 03:27:31 +04:00
2009-11-23 15:24:46 +03:00
sbi - > s_log_groups_per_flex = sbi - > s_es - > s_log_groups_per_flex ;
ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.
sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex < 2) { ... }
This patch fixes two potential issues in the previous commit.
1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount. That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0. This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.
2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways. Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.
groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex == 0 || groups_per_flex == 1) {
We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original. GCC keeps the check, but
there is no guarantee that future versions will do the same.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 20:51:10 +04:00
if ( sbi - > s_log_groups_per_flex < 1 | | sbi - > s_log_groups_per_flex > 31 ) {
2008-07-12 03:27:31 +04:00
sbi - > s_log_groups_per_flex = 0 ;
return 1 ;
}
2012-09-05 09:29:50 +04:00
err = ext4_alloc_flex_bg_array ( sb , sbi - > s_groups_count ) ;
if ( err )
2011-08-01 16:45:02 +04:00
goto failed ;
2008-07-12 03:27:31 +04:00
for ( i = 0 ; i < sbi - > s_groups_count ; i + + ) {
2009-05-25 19:50:39 +04:00
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2008-07-12 03:27:31 +04:00
flex_group = ext4_flex_group ( sbi , i ) ;
2009-09-12 00:51:28 +04:00
atomic_add ( ext4_free_inodes_count ( sb , gdp ) ,
& sbi - > s_flex_groups [ flex_group ] . free_inodes ) ;
2013-03-12 07:39:59 +04:00
atomic64_add ( ext4_free_group_clusters ( sb , gdp ) ,
& sbi - > s_flex_groups [ flex_group ] . free_clusters ) ;
2009-09-12 00:51:28 +04:00
atomic_add ( ext4_used_dirs_count ( sb , gdp ) ,
& sbi - > s_flex_groups [ flex_group ] . used_dirs ) ;
2008-07-12 03:27:31 +04:00
}
return 1 ;
failed :
return 0 ;
}
2015-10-17 23:18:43 +03:00
static __le16 ext4_group_desc_csum ( struct super_block * sb , __u32 block_group ,
2012-04-30 02:45:10 +04:00
struct ext4_group_desc * gdp )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
{
2016-07-04 00:51:39 +03:00
int offset = offsetof ( struct ext4_group_desc , bg_checksum ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
__u16 crc = 0 ;
2012-04-30 02:45:10 +04:00
__le32 le_group = cpu_to_le32 ( block_group ) ;
2015-10-17 23:18:43 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
2014-10-13 11:36:16 +04:00
if ( ext4_has_metadata_csum ( sbi - > s_sb ) ) {
2012-04-30 02:45:10 +04:00
/* Use new metadata_csum algorithm */
__u32 csum32 ;
2016-07-04 00:51:39 +03:00
__u16 dummy_csum = 0 ;
2012-04-30 02:45:10 +04:00
csum32 = ext4_chksum ( sbi , sbi - > s_csum_seed , ( __u8 * ) & le_group ,
sizeof ( le_group ) ) ;
2016-07-04 00:51:39 +03:00
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) gdp , offset ) ;
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) & dummy_csum ,
sizeof ( dummy_csum ) ) ;
offset + = sizeof ( dummy_csum ) ;
if ( offset < sbi - > s_desc_size )
csum32 = ext4_chksum ( sbi , csum32 , ( __u8 * ) gdp + offset ,
sbi - > s_desc_size - offset ) ;
2012-04-30 02:45:10 +04:00
crc = csum32 & 0xFFFF ;
goto out ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
}
2012-04-30 02:45:10 +04:00
/* old crc16 code */
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_gdt_csum ( sb ) )
2014-10-14 10:35:49 +04:00
return 0 ;
2012-04-30 02:45:10 +04:00
crc = crc16 ( ~ 0 , sbi - > s_es - > s_uuid , sizeof ( sbi - > s_es - > s_uuid ) ) ;
crc = crc16 ( crc , ( __u8 * ) & le_group , sizeof ( le_group ) ) ;
crc = crc16 ( crc , ( __u8 * ) gdp , offset ) ;
offset + = sizeof ( gdp - > bg_checksum ) ; /* skip checksum */
/* for checksum of struct ext4_group_desc do the rest...*/
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) & &
2012-04-30 02:45:10 +04:00
offset < le16_to_cpu ( sbi - > s_es - > s_desc_size ) )
crc = crc16 ( crc , ( __u8 * ) gdp + offset ,
le16_to_cpu ( sbi - > s_es - > s_desc_size ) -
offset ) ;
out :
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
return cpu_to_le16 ( crc ) ;
}
2012-04-30 02:45:10 +04:00
int ext4_group_desc_csum_verify ( struct super_block * sb , __u32 block_group ,
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
struct ext4_group_desc * gdp )
{
2012-04-30 02:45:10 +04:00
if ( ext4_has_group_desc_csum ( sb ) & &
2015-10-17 23:18:43 +03:00
( gdp - > bg_checksum ! = ext4_group_desc_csum ( sb , block_group , gdp ) ) )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
return 0 ;
return 1 ;
}
2012-04-30 02:45:10 +04:00
void ext4_group_desc_csum_set ( struct super_block * sb , __u32 block_group ,
struct ext4_group_desc * gdp )
{
if ( ! ext4_has_group_desc_csum ( sb ) )
return ;
2015-10-17 23:18:43 +03:00
gdp - > bg_checksum = ext4_group_desc_csum ( sb , block_group , gdp ) ;
2012-04-30 02:45:10 +04:00
}
2006-10-11 12:20:50 +04:00
/* Called at mount-time, super-block is locked */
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
static int ext4_check_descriptors ( struct super_block * sb ,
2016-08-01 07:51:02 +03:00
ext4_fsblk_t sb_block ,
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t * first_not_zeroed )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
ext4_fsblk_t first_block = le32_to_cpu ( sbi - > s_es - > s_first_data_block ) ;
ext4_fsblk_t last_block ;
2006-10-11 12:21:10 +04:00
ext4_fsblk_t block_bitmap ;
ext4_fsblk_t inode_bitmap ;
ext4_fsblk_t inode_table ;
2007-10-17 02:38:25 +04:00
int flexbg_flag = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t i , grp = sbi - > s_groups_count ;
2006-10-11 12:20:50 +04:00
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_flex_bg ( sb ) )
2007-10-17 02:38:25 +04:00
flexbg_flag = 1 ;
2008-09-09 06:25:24 +04:00
ext4_debug ( " Checking group descriptors " ) ;
2006-10-11 12:20:50 +04:00
2008-02-06 12:40:16 +03:00
for ( i = 0 ; i < sbi - > s_groups_count ; i + + ) {
struct ext4_group_desc * gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2007-10-17 02:38:25 +04:00
if ( i = = sbi - > s_groups_count - 1 | | flexbg_flag )
2006-10-11 12:21:10 +04:00
last_block = ext4_blocks_count ( sbi - > s_es ) - 1 ;
2006-10-11 12:20:50 +04:00
else
last_block = first_block +
2006-10-11 12:20:53 +04:00
( EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ) ;
2006-10-11 12:20:50 +04:00
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( ( grp = = sbi - > s_groups_count ) & &
! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
grp = i ;
2006-10-11 12:21:15 +04:00
block_bitmap = ext4_block_bitmap ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( block_bitmap = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Block bitmap for group %u overlaps "
" superblock " , i ) ;
}
2008-07-27 00:15:44 +04:00
if ( block_bitmap < first_block | | block_bitmap > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Block bitmap for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , block_bitmap ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2006-10-11 12:21:15 +04:00
inode_bitmap = ext4_inode_bitmap ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( inode_bitmap = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode bitmap for group %u overlaps "
" superblock " , i ) ;
}
2008-07-27 00:15:44 +04:00
if ( inode_bitmap < first_block | | inode_bitmap > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Inode bitmap for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , inode_bitmap ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2006-10-11 12:21:15 +04:00
inode_table = ext4_inode_table ( sb , gdp ) ;
2016-08-01 07:51:02 +03:00
if ( inode_table = = sb_block ) {
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Inode table for group %u overlaps "
" superblock " , i ) ;
}
2006-10-11 12:21:10 +04:00
if ( inode_table < first_block | |
2008-07-27 00:15:44 +04:00
inode_table + sbi - > s_itb_per_group - 1 > last_block ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
2009-01-06 06:18:16 +03:00
" Inode table for group %u not in group "
2009-06-05 01:36:36 +04:00
" (block %llu)! " , i , inode_table ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
2009-05-03 04:35:09 +04:00
ext4_lock_group ( sb , i ) ;
2012-04-30 02:45:10 +04:00
if ( ! ext4_group_desc_csum_verify ( sb , i , gdp ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " ext4_check_descriptors: "
" Checksum for group %u failed (%u!=%u) " ,
2015-10-17 23:18:43 +03:00
i , le16_to_cpu ( ext4_group_desc_csum ( sb , i ,
2009-06-05 01:36:36 +04:00
gdp ) ) , le16_to_cpu ( gdp - > bg_checksum ) ) ;
2008-09-08 18:47:19 +04:00
if ( ! ( sb - > s_flags & MS_RDONLY ) ) {
2009-05-03 04:35:09 +04:00
ext4_unlock_group ( sb , i ) ;
2008-07-26 22:34:21 +04:00
return 0 ;
2008-09-08 18:47:19 +04:00
}
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 02:38:25 +04:00
}
2009-05-03 04:35:09 +04:00
ext4_unlock_group ( sb , i ) ;
2007-10-17 02:38:25 +04:00
if ( ! flexbg_flag )
first_block + = EXT4_BLOCKS_PER_GROUP ( sb ) ;
2006-10-11 12:20:50 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( NULL ! = first_not_zeroed )
* first_not_zeroed = grp ;
2006-10-11 12:20:50 +04:00
return 1 ;
}
2006-10-11 12:20:53 +04:00
/* ext4_orphan_cleanup() walks a singly-linked list of inodes (starting at
2006-10-11 12:20:50 +04:00
* the superblock ) which were deleted from all directories , but held open by
* a process at the time of a crash . We walk the list and try to delete these
* inodes at recovery time ( only with a read - write filesystem ) .
*
* In order to keep the orphan inode chain consistent during traversal ( in
* case of crash during recovery ) , we link each inode into the superblock
* orphan list_head and handle it the same way as an inode deletion during
* normal operation ( which journals the operations for us ) .
*
* We only do an iget ( ) and an iput ( ) on each inode , which is very safe if we
* accidentally point at an in - use or already deleted inode . The worst that
* can happen in this case is that we get a " bit already cleared " message from
2006-10-11 12:20:53 +04:00
* ext4_free_inode ( ) . The only reason we would point at a wrong inode is if
2006-10-11 12:20:50 +04:00
* e2fsck was run on this filesystem , and it must have already done the orphan
* inode cleanup for us , so we can safely abort without any further action .
*/
2008-07-27 00:15:44 +04:00
static void ext4_orphan_cleanup ( struct super_block * sb ,
struct ext4_super_block * es )
2006-10-11 12:20:50 +04:00
{
unsigned int s_flags = sb - > s_flags ;
2016-11-14 06:02:26 +03:00
int ret , nr_orphans = 0 , nr_truncates = 0 ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
int i ;
# endif
if ( ! es - > s_last_orphan ) {
jbd_debug ( 4 , " no orphan inodes to clean up \n " ) ;
return ;
}
2006-12-07 07:40:13 +03:00
if ( bdev_read_only ( sb - > s_bdev ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " write access "
" unavailable, skipping orphan cleanup " ) ;
2006-12-07 07:40:13 +03:00
return ;
}
2011-02-28 08:53:45 +03:00
/* Check if feature set would not allow a r/w mount */
if ( ! ext4_feature_set_ok ( sb , 0 ) ) {
ext4_msg ( sb , KERN_INFO , " Skipping orphan cleanup due to "
" unknown ROCOMPAT features " ) ;
return ;
}
2006-10-11 12:20:53 +04:00
if ( EXT4_SB ( sb ) - > s_mount_state & EXT4_ERROR_FS ) {
2012-09-27 07:30:12 +04:00
/* don't clear list on RO mount w/ errors */
if ( es - > s_last_orphan & & ! ( s_flags & MS_RDONLY ) ) {
2014-09-16 22:52:03 +04:00
ext4_msg ( sb , KERN_INFO , " Errors on filesystem, "
2006-10-11 12:20:50 +04:00
" clearing orphan list. \n " ) ;
2012-09-27 07:30:12 +04:00
es - > s_last_orphan = 0 ;
}
2006-10-11 12:20:50 +04:00
jbd_debug ( 1 , " Skipping orphan recovery on fs with errors. \n " ) ;
return ;
}
if ( s_flags & MS_RDONLY ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " orphan cleanup on readonly fs " ) ;
2006-10-11 12:20:50 +04:00
sb - > s_flags & = ~ MS_RDONLY ;
}
# ifdef CONFIG_QUOTA
/* Needed for iput() to work correctly and not trash data */
sb - > s_flags | = MS_ACTIVE ;
/* Turn on quotas so that they are updated correctly */
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
2006-10-11 12:20:53 +04:00
if ( EXT4_SB ( sb ) - > s_qf_names [ i ] ) {
int ret = ext4_quota_on_mount ( sb , i ) ;
2006-10-11 12:20:50 +04:00
if ( ret < 0 )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" Cannot turn on journaled "
" quota: error %d " , ret ) ;
2006-10-11 12:20:50 +04:00
}
}
# endif
while ( es - > s_last_orphan ) {
struct inode * inode ;
2016-07-15 06:21:35 +03:00
/*
* We may have encountered an error during cleanup ; if
* so , skip the rest .
*/
if ( EXT4_SB ( sb ) - > s_mount_state & EXT4_ERROR_FS ) {
jbd_debug ( 1 , " Skipping orphan recovery on fs with errors. \n " ) ;
es - > s_last_orphan = 0 ;
break ;
}
2008-04-30 06:04:56 +04:00
inode = ext4_orphan_get ( sb , le32_to_cpu ( es - > s_last_orphan ) ) ;
if ( IS_ERR ( inode ) ) {
2006-10-11 12:20:50 +04:00
es - > s_last_orphan = 0 ;
break ;
}
2006-10-11 12:20:53 +04:00
list_add ( & EXT4_I ( inode ) - > i_orphan , & EXT4_SB ( sb ) - > s_orphan ) ;
2010-03-03 17:05:07 +03:00
dquot_initialize ( inode ) ;
2006-10-11 12:20:50 +04:00
if ( inode - > i_nlink ) {
2013-05-28 15:51:21 +04:00
if ( test_opt ( sb , DEBUG ) )
ext4_msg ( sb , KERN_DEBUG ,
" %s: truncating inode %lu to %lld bytes " ,
__func__ , inode - > i_ino , inode - > i_size ) ;
2008-09-09 06:25:04 +04:00
jbd_debug ( 2 , " truncating inode %lu to %lld bytes \n " ,
2006-10-11 12:20:50 +04:00
inode - > i_ino , inode - > i_size ) ;
2016-01-22 23:40:57 +03:00
inode_lock ( inode ) ;
2013-05-28 07:32:35 +04:00
truncate_inode_pages ( inode - > i_mapping , inode - > i_size ) ;
2016-11-14 06:02:26 +03:00
ret = ext4_truncate ( inode ) ;
if ( ret )
ext4_std_error ( inode - > i_sb , ret ) ;
2016-01-22 23:40:57 +03:00
inode_unlock ( inode ) ;
2006-10-11 12:20:50 +04:00
nr_truncates + + ;
} else {
2013-05-28 15:51:21 +04:00
if ( test_opt ( sb , DEBUG ) )
ext4_msg ( sb , KERN_DEBUG ,
" %s: deleting unreferenced inode %lu " ,
__func__ , inode - > i_ino ) ;
2006-10-11 12:20:50 +04:00
jbd_debug ( 2 , " deleting unreferenced inode %lu \n " ,
inode - > i_ino ) ;
nr_orphans + + ;
}
iput ( inode ) ; /* The delete magic happens here! */
}
2008-07-27 00:15:44 +04:00
# define PLURAL(x) (x), ((x) == 1) ? "" : "s"
2006-10-11 12:20:50 +04:00
if ( nr_orphans )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " %d orphan inode%s deleted " ,
PLURAL ( nr_orphans ) ) ;
2006-10-11 12:20:50 +04:00
if ( nr_truncates )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " %d truncate%s cleaned up " ,
PLURAL ( nr_truncates ) ) ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
/* Turn quotas off */
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
2006-10-11 12:20:50 +04:00
if ( sb_dqopt ( sb ) - > files [ i ] )
2010-05-19 15:16:45 +04:00
dquot_quota_off ( sb , i ) ;
2006-10-11 12:20:50 +04:00
}
# endif
sb - > s_flags = s_flags ; /* Restore MS_RDONLY status */
}
2009-06-04 01:59:28 +04:00
2008-01-29 07:58:27 +03:00
/*
* Maximal extent format file size .
* Resulting logical blkno at s_maxbytes must fit in our on - disk
* extent format containers , within a sector_t , and within i_blocks
* in the vfs . ext4 inode has 48 bits of i_block in fsblock units ,
* so that won ' t be a limiting factor .
*
2011-06-06 08:05:17 +04:00
* However there is other limiting factor . We do store extents in the form
* of starting block and length , hence the resulting length of the extent
* covering maximum file size must fit into on - disk format containers as
* well . Given that length is always by 1 unit bigger than max unit ( because
* we count 0 as well ) we have to lower the s_maxbytes by one fs block .
*
2008-01-29 07:58:27 +03:00
* Note , this does * not * consider any metadata overhead for vfs i_blocks .
*/
2008-10-17 06:50:48 +04:00
static loff_t ext4_max_size ( int blkbits , int has_huge_files )
2008-01-29 07:58:27 +03:00
{
loff_t res ;
loff_t upper_limit = MAX_LFS_FILESIZE ;
/* small i_blocks in vfs inode? */
2008-10-17 06:50:48 +04:00
if ( ! has_huge_files | | sizeof ( blkcnt_t ) < sizeof ( u64 ) ) {
2008-01-29 07:58:27 +03:00
/*
2009-06-19 10:08:50 +04:00
* CONFIG_LBDAF is not enabled implies the inode
2008-01-29 07:58:27 +03:00
* i_block represent total blocks in 512 bytes
* 32 = = size of vfs inode i_blocks * 8
*/
upper_limit = ( 1LL < < 32 ) - 1 ;
/* total blocks in file system block size */
upper_limit > > = ( blkbits - 9 ) ;
upper_limit < < = blkbits ;
}
2011-06-06 08:05:17 +04:00
/*
* 32 - bit extent - start container , ee_block . We lower the maxbytes
* by one fs block , so ee_len can cover the extent of maximum file
* size
*/
res = ( 1LL < < 32 ) - 1 ;
2008-01-29 07:58:27 +03:00
res < < = blkbits ;
/* Sanity check against vm- & vfs- imposed limits */
if ( res > upper_limit )
res = upper_limit ;
return res ;
}
2006-10-11 12:20:50 +04:00
/*
2008-01-29 07:58:27 +03:00
* Maximal bitmap file size . There is a direct , and { , double - , triple - } indirect
2008-01-29 07:58:26 +03:00
* block limit , and also a limit of ( 2 ^ 48 - 1 ) 512 - byte sectors in i_blocks .
* We need to be 1 filesystem block less than the 2 ^ 48 sector limit .
2006-10-11 12:20:50 +04:00
*/
2008-10-17 06:50:48 +04:00
static loff_t ext4_max_bitmap_size ( int bits , int has_huge_files )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
loff_t res = EXT4_NDIR_BLOCKS ;
2008-01-29 07:58:26 +03:00
int meta_blocks ;
loff_t upper_limit ;
2009-06-04 01:59:28 +04:00
/* This is calculated to be the largest file size for a dense, block
* mapped file such that the file ' s total number of 512 - byte sectors ,
* including data and all indirect blocks , does not exceed ( 2 ^ 48 - 1 ) .
*
* __u32 i_blocks_lo and _u16 i_blocks_high represent the total
* number of 512 - byte sectors of the file .
2008-01-29 07:58:26 +03:00
*/
2008-10-17 06:50:48 +04:00
if ( ! has_huge_files | | sizeof ( blkcnt_t ) < sizeof ( u64 ) ) {
2008-01-29 07:58:26 +03:00
/*
2009-06-19 10:08:50 +04:00
* ! has_huge_files or CONFIG_LBDAF not enabled implies that
2009-06-04 01:59:28 +04:00
* the inode i_block field represents total file blocks in
* 2 ^ 32 512 - byte sectors = = size of vfs inode i_blocks * 8
2008-01-29 07:58:26 +03:00
*/
upper_limit = ( 1LL < < 32 ) - 1 ;
/* total blocks in file system block size */
upper_limit > > = ( bits - 9 ) ;
} else {
2008-01-29 07:58:27 +03:00
/*
* We use 48 bit ext4_inode i_blocks
* With EXT4_HUGE_FILE_FL set the i_blocks
* represent total number of blocks in
* file system block size
*/
2008-01-29 07:58:26 +03:00
upper_limit = ( 1LL < < 48 ) - 1 ;
}
/* indirect blocks */
meta_blocks = 1 ;
/* double indirect blocks */
meta_blocks + = 1 + ( 1LL < < ( bits - 2 ) ) ;
/* tripple indirect blocks */
meta_blocks + = 1 + ( 1LL < < ( bits - 2 ) ) + ( 1LL < < ( 2 * ( bits - 2 ) ) ) ;
upper_limit - = meta_blocks ;
upper_limit < < = bits ;
2006-10-11 12:20:50 +04:00
res + = 1LL < < ( bits - 2 ) ;
res + = 1LL < < ( 2 * ( bits - 2 ) ) ;
res + = 1LL < < ( 3 * ( bits - 2 ) ) ;
res < < = bits ;
if ( res > upper_limit )
res = upper_limit ;
2008-01-29 07:58:26 +03:00
if ( res > MAX_LFS_FILESIZE )
res = MAX_LFS_FILESIZE ;
2006-10-11 12:20:50 +04:00
return res ;
}
2006-10-11 12:20:53 +04:00
static ext4_fsblk_t descriptor_loc ( struct super_block * sb ,
2009-06-04 01:59:28 +04:00
ext4_fsblk_t logical_sb_block , int nr )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2008-01-29 07:58:27 +03:00
ext4_group_t bg , first_meta_bg ;
2006-10-11 12:20:50 +04:00
int has_super = 0 ;
first_meta_bg = le32_to_cpu ( sbi - > s_es - > s_first_meta_bg ) ;
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_meta_bg ( sb ) | | nr < first_meta_bg )
2006-10-11 12:21:20 +04:00
return logical_sb_block + nr + 1 ;
2006-10-11 12:20:50 +04:00
bg = sbi - > s_desc_per_block * nr ;
2006-10-11 12:20:53 +04:00
if ( ext4_bg_has_super ( sb , bg ) )
2006-10-11 12:20:50 +04:00
has_super = 1 ;
2009-06-04 01:59:28 +04:00
2014-05-12 18:06:27 +04:00
/*
* If we have a meta_bg fs with 1 k blocks , group 0 ' s GDT is at
* block 2 , not 1. If s_first_data_block = = 0 ( bigalloc is enabled
* on modern mke2fs or blksize > 1 k on older mke2fs ) then we must
* compensate .
*/
if ( sb - > s_blocksize = = 1024 & & nr = = 0 & &
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_first_data_block ) = = 0 )
has_super + + ;
2006-10-11 12:20:53 +04:00
return ( has_super + ext4_group_first_block_no ( sb , bg ) ) ;
2006-10-11 12:20:50 +04:00
}
2008-01-29 08:19:52 +03:00
/**
* ext4_get_stripe_size : Get the stripe size .
* @ sbi : In memory super block info
*
* If we have specified it via mount option , then
* use the mount option value . If the value specified at mount time is
* greater than the blocks per group use the super block value .
* If the super block value is greater than blocks per group return 0.
* Allocator needs it be less than blocks per group .
*
*/
static unsigned long ext4_get_stripe_size ( struct ext4_sb_info * sbi )
{
unsigned long stride = le16_to_cpu ( sbi - > s_es - > s_raid_stride ) ;
unsigned long stripe_width =
le32_to_cpu ( sbi - > s_es - > s_raid_stripe_width ) ;
2011-07-18 05:18:51 +04:00
int ret ;
2008-01-29 08:19:52 +03:00
if ( sbi - > s_stripe & & sbi - > s_stripe < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = sbi - > s_stripe ;
2017-02-10 08:56:09 +03:00
else if ( stripe_width & & stripe_width < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = stripe_width ;
2017-02-10 08:56:09 +03:00
else if ( stride & & stride < = sbi - > s_blocks_per_group )
2011-07-18 05:18:51 +04:00
ret = stride ;
else
ret = 0 ;
2008-01-29 08:19:52 +03:00
2011-07-18 05:18:51 +04:00
/*
* If the stripe width is 1 , this makes no sense and
* we set it to 0 to turn off stripe handling code .
*/
if ( ret < = 1 )
ret = 0 ;
2008-01-29 08:19:52 +03:00
2011-07-18 05:18:51 +04:00
return ret ;
2008-01-29 08:19:52 +03:00
}
2006-10-11 12:20:50 +04:00
2009-08-18 08:20:23 +04:00
/*
* Check whether this filesystem can be mounted based on
* the features present and the RDONLY / RDWR mount requested .
* Returns 1 if this filesystem can be mounted as requested ,
* 0 if it cannot be .
*/
static int ext4_feature_set_ok ( struct super_block * sb , int readonly )
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext4_incompat_features ( sb ) ) {
2009-08-18 08:20:23 +04:00
ext4_msg ( sb , KERN_ERR ,
" Couldn't mount because of "
" unsupported optional features (%x) " ,
( le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_feature_incompat ) &
~ EXT4_FEATURE_INCOMPAT_SUPP ) ) ;
return 0 ;
}
if ( readonly )
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_readonly ( sb ) ) {
2015-02-13 06:31:21 +03:00
ext4_msg ( sb , KERN_INFO , " filesystem is read-only " ) ;
sb - > s_flags | = MS_RDONLY ;
return 1 ;
}
2009-08-18 08:20:23 +04:00
/* Check that feature set is OK for a read-write mount */
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext4_ro_compat_features ( sb ) ) {
2009-08-18 08:20:23 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't mount RDWR because of "
" unsupported optional features (%x) " ,
( le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_feature_ro_compat ) &
~ EXT4_FEATURE_RO_COMPAT_SUPP ) ) ;
return 0 ;
}
/*
* Large file size enabled file system can only be mounted
* read - write on 32 - bit systems if kernel is built with CONFIG_LBDAF
*/
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_huge_file ( sb ) ) {
2009-08-18 08:20:23 +04:00
if ( sizeof ( blkcnt_t ) < sizeof ( u64 ) ) {
ext4_msg ( sb , KERN_ERR , " Filesystem with huge files "
" cannot be mounted RDWR without "
" CONFIG_LBDAF " ) ;
return 0 ;
}
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_bigalloc ( sb ) & & ! ext4_has_feature_extents ( sb ) ) {
2011-09-10 02:36:51 +04:00
ext4_msg ( sb , KERN_ERR ,
" Can't support bigalloc feature without "
" extents feature \n " ) ;
return 0 ;
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
# ifndef CONFIG_QUOTA
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) & & ! readonly ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
ext4_msg ( sb , KERN_ERR ,
" Filesystem with quota feature cannot be mounted RDWR "
" without CONFIG_QUOTA " ) ;
return 0 ;
}
2016-01-09 00:01:22 +03:00
if ( ext4_has_feature_project ( sb ) & & ! readonly ) {
ext4_msg ( sb , KERN_ERR ,
" Filesystem with project quota feature cannot be mounted RDWR "
" without CONFIG_QUOTA " ) ;
return 0 ;
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
# endif /* CONFIG_QUOTA */
2009-08-18 08:20:23 +04:00
return 1 ;
}
2010-07-27 19:56:04 +04:00
/*
* This function is called once a day if we have errors logged
* on the file system
*/
static void print_daily_error_info ( unsigned long arg )
{
struct super_block * sb = ( struct super_block * ) arg ;
struct ext4_sb_info * sbi ;
struct ext4_super_block * es ;
sbi = EXT4_SB ( sb ) ;
es = sbi - > s_es ;
if ( es - > s_error_count )
2014-07-06 02:40:52 +04:00
/* fsck newer than v1.41.13 is needed to clean this condition. */
ext4_msg ( sb , KERN_NOTICE , " error count since last fsck: %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_error_count ) ) ;
if ( es - > s_first_error_time ) {
2014-07-06 02:40:52 +04:00
printk ( KERN_NOTICE " EXT4-fs (%s): initial error at time %u: %.*s:%d " ,
2010-07-27 19:56:04 +04:00
sb - > s_id , le32_to_cpu ( es - > s_first_error_time ) ,
( int ) sizeof ( es - > s_first_error_func ) ,
es - > s_first_error_func ,
le32_to_cpu ( es - > s_first_error_line ) ) ;
if ( es - > s_first_error_ino )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : inode %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_first_error_ino ) ) ;
if ( es - > s_first_error_block )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : block %llu " , ( unsigned long long )
2010-07-27 19:56:04 +04:00
le64_to_cpu ( es - > s_first_error_block ) ) ;
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " \n " ) ;
2010-07-27 19:56:04 +04:00
}
if ( es - > s_last_error_time ) {
2014-07-06 02:40:52 +04:00
printk ( KERN_NOTICE " EXT4-fs (%s): last error at time %u: %.*s:%d " ,
2010-07-27 19:56:04 +04:00
sb - > s_id , le32_to_cpu ( es - > s_last_error_time ) ,
( int ) sizeof ( es - > s_last_error_func ) ,
es - > s_last_error_func ,
le32_to_cpu ( es - > s_last_error_line ) ) ;
if ( es - > s_last_error_ino )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : inode %u " ,
2010-07-27 19:56:04 +04:00
le32_to_cpu ( es - > s_last_error_ino ) ) ;
if ( es - > s_last_error_block )
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " : block %llu " , ( unsigned long long )
2010-07-27 19:56:04 +04:00
le64_to_cpu ( es - > s_last_error_block ) ) ;
2016-10-13 06:12:53 +03:00
printk ( KERN_CONT " \n " ) ;
2010-07-27 19:56:04 +04:00
}
mod_timer ( & sbi - > s_err_report , jiffies + 24 * 60 * 60 * HZ ) ; /* Once a day */
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/* Find next suitable group and run ext4_init_inode_table */
static int ext4_run_li_request ( struct ext4_li_request * elr )
{
struct ext4_group_desc * gdp = NULL ;
ext4_group_t group , ngroups ;
struct super_block * sb ;
unsigned long timeout = 0 ;
int ret = 0 ;
sb = elr - > lr_super ;
ngroups = EXT4_SB ( sb ) - > s_groups_count ;
for ( group = elr - > lr_next_group ; group < ngroups ; group + + ) {
gdp = ext4_get_group_desc ( sb , group , NULL ) ;
if ( ! gdp ) {
ret = 1 ;
break ;
}
if ( ! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
break ;
}
2013-01-13 17:41:45 +04:00
if ( group > = ngroups )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ret = 1 ;
if ( ! ret ) {
timeout = jiffies ;
ret = ext4_init_inode_table ( sb , group ,
elr - > lr_timeout ? 0 : 1 ) ;
if ( elr - > lr_timeout = = 0 ) {
2011-05-20 21:55:16 +04:00
timeout = ( jiffies - timeout ) *
elr - > lr_sbi - > s_li_wait_mult ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
elr - > lr_timeout = timeout ;
}
elr - > lr_next_sched = jiffies + elr - > lr_timeout ;
elr - > lr_next_group = group + 1 ;
}
return ret ;
}
/*
* Remove lr_request from the list_request and free the
2011-05-20 21:49:04 +04:00
* request structure . Should be called with li_list_mtx held
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
*/
static void ext4_remove_li_request ( struct ext4_li_request * elr )
{
struct ext4_sb_info * sbi ;
if ( ! elr )
return ;
sbi = elr - > lr_sbi ;
list_del ( & elr - > lr_request ) ;
sbi - > s_li_request = NULL ;
kfree ( elr ) ;
}
static void ext4_unregister_li_request ( struct super_block * sb )
{
2011-05-20 21:55:29 +04:00
mutex_lock ( & ext4_li_mtx ) ;
if ( ! ext4_li_info ) {
mutex_unlock ( & ext4_li_mtx ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return ;
2011-05-20 21:55:29 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
2011-05-20 21:55:29 +04:00
ext4_remove_li_request ( EXT4_SB ( sb ) - > s_li_request ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
2011-05-20 21:55:29 +04:00
mutex_unlock ( & ext4_li_mtx ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
2011-02-03 22:33:15 +03:00
static struct task_struct * ext4_lazyinit_task ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/*
* This is the function where ext4lazyinit thread lives . It walks
* through the request list searching for next scheduled filesystem .
* When such a fs is found , run the lazy initialization request
* ( ext4_rn_li_request ) and keep track of the time spend in this
* function . Based on that time we compute next schedule time of
* the request . When walking through the list is complete , compute
* next waking time and put itself into sleep .
*/
static int ext4_lazyinit_thread ( void * arg )
{
struct ext4_lazy_init * eli = ( struct ext4_lazy_init * ) arg ;
struct list_head * pos , * n ;
struct ext4_li_request * elr ;
2011-05-20 21:49:04 +04:00
unsigned long next_wakeup , cur ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
BUG_ON ( NULL = = eli ) ;
cont_thread :
while ( true ) {
next_wakeup = MAX_JIFFY_OFFSET ;
mutex_lock ( & eli - > li_list_mtx ) ;
if ( list_empty ( & eli - > li_request_list ) ) {
mutex_unlock ( & eli - > li_list_mtx ) ;
goto exit_thread ;
}
list_for_each_safe ( pos , n , & eli - > li_request_list ) {
2016-09-06 06:38:36 +03:00
int err = 0 ;
int progress = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
elr = list_entry ( pos , struct ext4_li_request ,
lr_request ) ;
2016-09-06 06:38:36 +03:00
if ( time_before ( jiffies , elr - > lr_next_sched ) ) {
if ( time_before ( elr - > lr_next_sched , next_wakeup ) )
next_wakeup = elr - > lr_next_sched ;
continue ;
}
if ( down_read_trylock ( & elr - > lr_super - > s_umount ) ) {
if ( sb_start_write_trylock ( elr - > lr_super ) ) {
progress = 1 ;
/*
* We hold sb - > s_umount , sb can not
* be removed from the list , it is
* now safe to drop li_list_mtx
*/
mutex_unlock ( & eli - > li_list_mtx ) ;
err = ext4_run_li_request ( elr ) ;
sb_end_write ( elr - > lr_super ) ;
mutex_lock ( & eli - > li_list_mtx ) ;
n = pos - > next ;
2010-11-02 21:19:30 +03:00
}
2016-09-06 06:38:36 +03:00
up_read ( ( & elr - > lr_super - > s_umount ) ) ;
}
/* error, remove the lazy_init job */
if ( err ) {
ext4_remove_li_request ( elr ) ;
continue ;
}
if ( ! progress ) {
elr - > lr_next_sched = jiffies +
( prandom_u32 ( )
% ( EXT4_DEF_LI_MAX_START_DELAY * HZ ) ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
if ( time_before ( elr - > lr_next_sched , next_wakeup ) )
next_wakeup = elr - > lr_next_sched ;
}
mutex_unlock ( & eli - > li_list_mtx ) ;
2011-11-22 00:32:22 +04:00
try_to_freeze ( ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2011-05-20 21:49:04 +04:00
cur = jiffies ;
if ( ( time_after_eq ( cur , next_wakeup ) ) | |
2010-11-02 21:07:17 +03:00
( MAX_JIFFY_OFFSET = = next_wakeup ) ) {
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
cond_resched ( ) ;
continue ;
}
2011-05-20 21:49:04 +04:00
schedule_timeout_interruptible ( next_wakeup - cur ) ;
2011-02-03 22:33:15 +03:00
if ( kthread_should_stop ( ) ) {
ext4_clear_request_list ( ) ;
goto exit_thread ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
exit_thread :
/*
* It looks like the request list is empty , but we need
* to check it under the li_list_mtx lock , to prevent any
* additions into it , and of course we should lock ext4_li_mtx
* to atomically free the list and ext4_li_info , because at
* this point another ext4 filesystem could be registering
* new one .
*/
mutex_lock ( & ext4_li_mtx ) ;
mutex_lock ( & eli - > li_list_mtx ) ;
if ( ! list_empty ( & eli - > li_request_list ) ) {
mutex_unlock ( & eli - > li_list_mtx ) ;
mutex_unlock ( & ext4_li_mtx ) ;
goto cont_thread ;
}
mutex_unlock ( & eli - > li_list_mtx ) ;
kfree ( ext4_li_info ) ;
ext4_li_info = NULL ;
mutex_unlock ( & ext4_li_mtx ) ;
return 0 ;
}
static void ext4_clear_request_list ( void )
{
struct list_head * pos , * n ;
struct ext4_li_request * elr ;
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
list_for_each_safe ( pos , n , & ext4_li_info - > li_request_list ) {
elr = list_entry ( pos , struct ext4_li_request ,
lr_request ) ;
ext4_remove_li_request ( elr ) ;
}
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
}
static int ext4_run_lazyinit_thread ( void )
{
2011-02-03 22:33:15 +03:00
ext4_lazyinit_task = kthread_run ( ext4_lazyinit_thread ,
ext4_li_info , " ext4lazyinit " ) ;
if ( IS_ERR ( ext4_lazyinit_task ) ) {
int err = PTR_ERR ( ext4_lazyinit_task ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_clear_request_list ( ) ;
kfree ( ext4_li_info ) ;
ext4_li_info = NULL ;
2012-03-20 07:41:49 +04:00
printk ( KERN_CRIT " EXT4-fs: error %d creating inode table "
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
" initialization thread \n " ,
err ) ;
return err ;
}
ext4_li_info - > li_state | = EXT4_LAZYINIT_RUNNING ;
return 0 ;
}
/*
* Check whether it make sense to run itable init . thread or not .
* If there is at least one uninitialized inode table , return
* corresponding group number , else the loop goes through all
* groups and return total number of groups .
*/
static ext4_group_t ext4_has_uninit_itable ( struct super_block * sb )
{
ext4_group_t group , ngroups = EXT4_SB ( sb ) - > s_groups_count ;
struct ext4_group_desc * gdp = NULL ;
for ( group = 0 ; group < ngroups ; group + + ) {
gdp = ext4_get_group_desc ( sb , group , NULL ) ;
if ( ! gdp )
continue ;
if ( ! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ) )
break ;
}
return group ;
}
static int ext4_li_info_new ( void )
{
struct ext4_lazy_init * eli = NULL ;
eli = kzalloc ( sizeof ( * eli ) , GFP_KERNEL ) ;
if ( ! eli )
return - ENOMEM ;
INIT_LIST_HEAD ( & eli - > li_request_list ) ;
mutex_init ( & eli - > li_list_mtx ) ;
eli - > li_state | = EXT4_LAZYINIT_QUIT ;
ext4_li_info = eli ;
return 0 ;
}
static struct ext4_li_request * ext4_li_request_new ( struct super_block * sb ,
ext4_group_t start )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_li_request * elr ;
elr = kzalloc ( sizeof ( * elr ) , GFP_KERNEL ) ;
if ( ! elr )
return NULL ;
elr - > lr_super = sb ;
elr - > lr_sbi = sbi ;
elr - > lr_next_group = start ;
/*
* Randomize first schedule time of the request to
* spread the inode table initialization requests
* better .
*/
2013-11-08 09:14:53 +04:00
elr - > lr_next_sched = jiffies + ( prandom_u32 ( ) %
( EXT4_DEF_LI_MAX_START_DELAY * HZ ) ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return elr ;
}
2013-01-13 17:41:45 +04:00
int ext4_register_li_request ( struct super_block * sb ,
ext4_group_t first_not_zeroed )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2013-01-13 17:41:45 +04:00
struct ext4_li_request * elr = NULL ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t ngroups = EXT4_SB ( sb ) - > s_groups_count ;
2011-01-10 20:30:17 +03:00
int ret = 0 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2013-01-13 17:41:45 +04:00
mutex_lock ( & ext4_li_mtx ) ;
2011-05-20 21:55:16 +04:00
if ( sbi - > s_li_request ! = NULL ) {
/*
* Reset timeout so it can be computed again , because
* s_li_wait_mult might have changed .
*/
sbi - > s_li_request - > lr_timeout = 0 ;
2013-01-13 17:41:45 +04:00
goto out ;
2011-05-20 21:55:16 +04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( first_not_zeroed = = ngroups | |
( sb - > s_flags & MS_RDONLY ) | |
2011-05-09 18:28:41 +04:00
! test_opt ( sb , INIT_INODE_TABLE ) )
2013-01-13 17:41:45 +04:00
goto out ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
elr = ext4_li_request_new ( sb , first_not_zeroed ) ;
2013-01-13 17:41:45 +04:00
if ( ! elr ) {
ret = - ENOMEM ;
goto out ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( NULL = = ext4_li_info ) {
ret = ext4_li_info_new ( ) ;
if ( ret )
goto out ;
}
mutex_lock ( & ext4_li_info - > li_list_mtx ) ;
list_add ( & elr - > lr_request , & ext4_li_info - > li_request_list ) ;
mutex_unlock ( & ext4_li_info - > li_list_mtx ) ;
sbi - > s_li_request = elr ;
2011-04-05 00:00:49 +04:00
/*
* set elr to NULL here since it has been inserted to
* the request_list and the removal and free of it is
* handled by ext4_clear_request_list from now on .
*/
elr = NULL ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
if ( ! ( ext4_li_info - > li_state & EXT4_LAZYINIT_RUNNING ) ) {
ret = ext4_run_lazyinit_thread ( ) ;
if ( ret )
goto out ;
}
out :
2010-10-28 06:08:42 +04:00
mutex_unlock ( & ext4_li_mtx ) ;
if ( ret )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
kfree ( elr ) ;
return ret ;
}
/*
* We do not need to lock anything since this is called on
* module unload .
*/
static void ext4_destroy_lazyinit_thread ( void )
{
/*
* If thread exited earlier
* there ' s nothing to be done .
*/
2011-02-03 22:33:15 +03:00
if ( ! ext4_li_info | | ! ext4_lazyinit_task )
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
return ;
2011-02-03 22:33:15 +03:00
kthread_stop ( ext4_lazyinit_task ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
}
2012-05-27 15:48:56 +04:00
static int set_journal_csum_feature_set ( struct super_block * sb )
{
int ret = 1 ;
int compat , incompat ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2014-10-13 11:36:16 +04:00
if ( ext4_has_metadata_csum ( sb ) ) {
2014-08-28 02:40:07 +04:00
/* journal checksum v3 */
2012-05-27 15:48:56 +04:00
compat = 0 ;
2014-08-28 02:40:07 +04:00
incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3 ;
2012-05-27 15:48:56 +04:00
} else {
/* journal checksum v1 */
compat = JBD2_FEATURE_COMPAT_CHECKSUM ;
incompat = 0 ;
}
2014-09-11 19:38:21 +04:00
jbd2_journal_clear_features ( sbi - > s_journal ,
JBD2_FEATURE_COMPAT_CHECKSUM , 0 ,
JBD2_FEATURE_INCOMPAT_CSUM_V3 |
JBD2_FEATURE_INCOMPAT_CSUM_V2 ) ;
2012-05-27 15:48:56 +04:00
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ret = jbd2_journal_set_features ( sbi - > s_journal ,
compat , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
incompat ) ;
} else if ( test_opt ( sb , JOURNAL_CHECKSUM ) ) {
ret = jbd2_journal_set_features ( sbi - > s_journal ,
compat , 0 ,
incompat ) ;
jbd2_journal_clear_features ( sbi - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT ) ;
} else {
2014-09-11 19:38:21 +04:00
jbd2_journal_clear_features ( sbi - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT ) ;
2012-05-27 15:48:56 +04:00
}
return ret ;
}
2012-07-10 00:27:05 +04:00
/*
* Note : calculating the overhead so we can be compatible with
* historical BSD practice is quite difficult in the face of
* clusters / bigalloc . This is because multiple metadata blocks from
* different block group can end up in the same allocation cluster .
* Calculating the exact overhead in the face of clustered allocation
* requires either O ( all block bitmaps ) in memory or O ( number of block
* groups * * 2 ) in time . We will still calculate the superblock for
* older file systems - - - and if we come across with a bigalloc file
* system with zero in s_overhead_clusters the estimate will be close to
* correct especially for very large cluster sizes - - - but for newer
* file systems , it ' s better to calculate this figure once at mkfs
* time , and store it in the superblock . If the superblock value is
* present ( even for non - bigalloc file systems ) , we will use it .
*/
static int count_overhead ( struct super_block * sb , ext4_group_t grp ,
char * buf )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_desc * gdp ;
ext4_fsblk_t first_block , last_block , b ;
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
int s , j , count = 0 ;
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_bigalloc ( sb ) )
2012-08-16 19:59:04 +04:00
return ( ext4_bg_has_super ( sb , grp ) + ext4_bg_num_gdb ( sb , grp ) +
sbi - > s_itb_per_group + 2 ) ;
2012-07-10 00:27:05 +04:00
first_block = le32_to_cpu ( sbi - > s_es - > s_first_data_block ) +
( grp * EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
last_block = first_block + EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ;
for ( i = 0 ; i < ngroups ; i + + ) {
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
b = ext4_block_bitmap ( sb , gdp ) ;
if ( b > = first_block & & b < = last_block ) {
ext4_set_bit ( EXT4_B2C ( sbi , b - first_block ) , buf ) ;
count + + ;
}
b = ext4_inode_bitmap ( sb , gdp ) ;
if ( b > = first_block & & b < = last_block ) {
ext4_set_bit ( EXT4_B2C ( sbi , b - first_block ) , buf ) ;
count + + ;
}
b = ext4_inode_table ( sb , gdp ) ;
if ( b > = first_block & & b + sbi - > s_itb_per_group < = last_block )
for ( j = 0 ; j < sbi - > s_itb_per_group ; j + + , b + + ) {
int c = EXT4_B2C ( sbi , b - first_block ) ;
ext4_set_bit ( c , buf ) ;
count + + ;
}
if ( i ! = grp )
continue ;
s = 0 ;
if ( ext4_bg_has_super ( sb , grp ) ) {
ext4_set_bit ( s + + , buf ) ;
count + + ;
}
2016-11-18 21:37:47 +03:00
j = ext4_bg_num_gdb ( sb , grp ) ;
if ( s + j > EXT4_BLOCKS_PER_GROUP ( sb ) ) {
ext4_error ( sb , " Invalid number of block group "
" descriptor blocks: %d " , j ) ;
j = EXT4_BLOCKS_PER_GROUP ( sb ) - s ;
2012-07-10 00:27:05 +04:00
}
2016-11-18 21:37:47 +03:00
count + = j ;
for ( ; j > 0 ; j - - )
ext4_set_bit ( EXT4_B2C ( sbi , s + + ) , buf ) ;
2012-07-10 00:27:05 +04:00
}
if ( ! count )
return 0 ;
return EXT4_CLUSTERS_PER_GROUP ( sb ) -
ext4_count_free ( buf , EXT4_CLUSTERS_PER_GROUP ( sb ) / 8 ) ;
}
/*
* Compute the overhead and stash it in sbi - > s_overhead
*/
int ext4_calculate_overhead ( struct super_block * sb )
{
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2016-09-30 09:08:49 +03:00
struct inode * j_inode ;
unsigned int j_blocks , j_inum = le32_to_cpu ( es - > s_journal_inum ) ;
2012-07-10 00:27:05 +04:00
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
ext4_fsblk_t overhead = 0 ;
2014-11-25 21:08:04 +03:00
char * buf = ( char * ) get_zeroed_page ( GFP_NOFS ) ;
2012-07-10 00:27:05 +04:00
if ( ! buf )
return - ENOMEM ;
/*
* Compute the overhead ( FS structures ) . This is constant
* for a given filesystem unless the number of block groups
* changes so we cache the previous value until it does .
*/
/*
* All of the blocks before first_data_block are overhead
*/
overhead = EXT4_B2C ( sbi , le32_to_cpu ( es - > s_first_data_block ) ) ;
/*
* Add the overhead found in each block group
*/
for ( i = 0 ; i < ngroups ; i + + ) {
int blks ;
blks = count_overhead ( sb , i , buf ) ;
overhead + = blks ;
if ( blks )
memset ( buf , 0 , PAGE_SIZE ) ;
cond_resched ( ) ;
}
2016-09-30 09:08:49 +03:00
/*
* Add the internal journal blocks whether the journal has been
* loaded or not
*/
2014-11-26 00:27:44 +03:00
if ( sbi - > s_journal & & ! sbi - > journal_bdev )
2013-03-03 02:18:58 +04:00
overhead + = EXT4_NUM_B2C ( sbi , sbi - > s_journal - > j_maxlen ) ;
2016-09-30 09:08:49 +03:00
else if ( ext4_has_feature_journal ( sb ) & & ! sbi - > s_journal ) {
j_inode = ext4_get_journal_inode ( sb , j_inum ) ;
if ( j_inode ) {
j_blocks = j_inode - > i_size > > sb - > s_blocksize_bits ;
overhead + = EXT4_NUM_B2C ( sbi , j_blocks ) ;
iput ( j_inode ) ;
} else {
ext4_msg ( sb , KERN_ERR , " can't get journal size " ) ;
}
}
2012-07-10 00:27:05 +04:00
sbi - > s_overhead = overhead ;
smp_wmb ( ) ;
free_page ( ( unsigned long ) buf ) ;
return 0 ;
}
2015-09-23 19:44:17 +03:00
static void ext4_set_resv_clusters ( struct super_block * sb )
2013-04-10 06:11:22 +04:00
{
ext4_fsblk_t resv_clusters ;
2015-09-23 19:44:17 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2013-04-10 06:11:22 +04:00
2013-12-09 06:11:59 +04:00
/*
* There ' s no need to reserve anything when we aren ' t using extents .
* The space estimates are exact , there are no unwritten extents ,
* hole punching doesn ' t need new metadata . . . This is needed especially
* to keep ext2 / 3 backward compatibility .
*/
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_extents ( sb ) )
2015-09-23 19:44:17 +03:00
return ;
2013-04-10 06:11:22 +04:00
/*
* By default we reserve 2 % or 4096 clusters , whichever is smaller .
* This should cover the situations where we can not afford to run
* out of space like for example punch hole , or converting
2014-04-21 07:45:47 +04:00
* unwritten extents in delalloc path . In most cases such
2013-04-10 06:11:22 +04:00
* allocation would require 1 , or 2 blocks , higher numbers are
* very rare .
*/
2015-09-23 19:44:17 +03:00
resv_clusters = ( ext4_blocks_count ( sbi - > s_es ) > >
sbi - > s_cluster_bits ) ;
2013-04-10 06:11:22 +04:00
do_div ( resv_clusters , 50 ) ;
resv_clusters = min_t ( ext4_fsblk_t , resv_clusters , 4096 ) ;
2015-09-23 19:44:17 +03:00
atomic64_set ( & sbi - > s_resv_clusters , resv_clusters ) ;
2013-04-10 06:11:22 +04:00
}
2008-07-27 00:15:44 +04:00
static int ext4_fill_super ( struct super_block * sb , void * data , int silent )
2006-10-11 12:20:50 +04:00
{
2010-05-16 20:00:00 +04:00
char * orig_data = kstrdup ( data , GFP_KERNEL ) ;
2008-07-27 00:15:44 +04:00
struct buffer_head * bh ;
2006-10-11 12:20:53 +04:00
struct ext4_super_block * es = NULL ;
2016-11-18 21:24:26 +03:00
struct ext4_sb_info * sbi = kzalloc ( sizeof ( * sbi ) , GFP_KERNEL ) ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t block ;
ext4_fsblk_t sb_block = get_sb_block ( & data ) ;
2006-10-11 12:21:20 +04:00
ext4_fsblk_t logical_sb_block ;
2006-10-11 12:20:50 +04:00
unsigned long offset = 0 ;
unsigned long journal_devnum = 0 ;
unsigned long def_mount_opts ;
struct inode * root ;
2009-01-07 08:06:22 +03:00
const char * descr ;
2010-07-27 19:56:07 +04:00
int ret = - ENOMEM ;
2011-09-10 02:34:51 +04:00
int blocksize , clustersize ;
2009-01-06 22:53:26 +03:00
unsigned int db_count ;
unsigned int i ;
2011-09-10 02:34:51 +04:00
int needs_recovery , has_huge_files , has_bigalloc ;
2006-10-11 12:21:10 +04:00
__u64 blocks_count ;
2012-11-09 00:16:54 +04:00
int err = 0 ;
2009-01-06 06:46:26 +03:00
unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_group_t first_not_zeroed ;
2006-10-11 12:20:50 +04:00
2016-11-18 21:24:26 +03:00
if ( ( data & & ! orig_data ) | | ! sbi )
goto out_free_base ;
2009-02-16 02:07:52 +03:00
sbi - > s_blockgroup_lock =
kzalloc ( sizeof ( struct blockgroup_lock ) , GFP_KERNEL ) ;
2016-11-18 21:24:26 +03:00
if ( ! sbi - > s_blockgroup_lock )
goto out_free_base ;
2006-10-11 12:20:50 +04:00
sb - > s_fs_info = sbi ;
2012-05-31 06:56:46 +04:00
sbi - > s_sb = sb ;
2008-10-10 07:53:47 +04:00
sbi - > s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS ;
2007-10-17 10:26:27 +04:00
sbi - > s_sb_block = sb_block ;
2010-07-27 19:56:08 +04:00
if ( sb - > s_bdev - > bd_part )
sbi - > s_sectors_written_start =
part_stat_read ( sb - > s_bdev - > bd_part , sectors [ 1 ] ) ;
2006-10-11 12:20:50 +04:00
2008-09-23 17:18:24 +04:00
/* Cleanup superblock name */
2015-06-26 01:02:41 +03:00
strreplace ( sb - > s_id , ' / ' , ' ! ' ) ;
2008-09-23 17:18:24 +04:00
2012-11-09 00:16:54 +04:00
/* -EINVAL is default */
2010-07-27 19:56:07 +04:00
ret = - EINVAL ;
2006-10-11 12:20:53 +04:00
blocksize = sb_min_blocksize ( sb , EXT4_MIN_BLOCK_SIZE ) ;
2006-10-11 12:20:50 +04:00
if ( ! blocksize ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " unable to set blocksize " ) ;
2006-10-11 12:20:50 +04:00
goto out_fail ;
}
/*
2006-10-11 12:20:53 +04:00
* The ext4 superblock will not be buffer aligned for other than 1 kB
2006-10-11 12:20:50 +04:00
* block sizes . We need to calculate the offset from buffer start .
*/
2006-10-11 12:20:53 +04:00
if ( blocksize ! = EXT4_MIN_BLOCK_SIZE ) {
2006-10-11 12:21:20 +04:00
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE ;
offset = do_div ( logical_sb_block , blocksize ) ;
2006-10-11 12:20:50 +04:00
} else {
2006-10-11 12:21:20 +04:00
logical_sb_block = sb_block ;
2006-10-11 12:20:50 +04:00
}
2014-09-05 06:36:15 +04:00
if ( ! ( bh = sb_bread_unmovable ( sb , logical_sb_block ) ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " unable to read superblock " ) ;
2006-10-11 12:20:50 +04:00
goto out_fail ;
}
/*
* Note : s_es must be initialized as soon as possible because
2006-10-11 12:20:53 +04:00
* some ext4 macro - instructions depend on its value
2006-10-11 12:20:50 +04:00
*/
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_es = es ;
sb - > s_magic = le16_to_cpu ( es - > s_magic ) ;
2006-10-11 12:20:53 +04:00
if ( sb - > s_magic ! = EXT4_SUPER_MAGIC )
goto cantfind_ext4 ;
2009-03-01 03:39:58 +03:00
sbi - > s_kbytes_written = le64_to_cpu ( es - > s_kbytes_written ) ;
2006-10-11 12:20:50 +04:00
2012-04-30 02:45:10 +04:00
/* Warn if metadata_csum and gdt_csum are both set. */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_metadata_csum ( sb ) & &
ext4_has_feature_gdt_csum ( sb ) )
2015-01-02 23:31:14 +03:00
ext4_warning ( sb , " metadata_csum and uninit_bg are "
2012-04-30 02:45:10 +04:00
" redundant flags; please run fsck. " ) ;
2012-04-30 02:25:10 +04:00
/* Check for a known checksum algorithm */
if ( ! ext4_verify_csum_type ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " VFS: Found ext4 filesystem with "
" unknown checksum algorithm. " ) ;
silent = 1 ;
goto cantfind_ext4 ;
}
2012-04-30 02:27:10 +04:00
/* Load the checksum driver */
2017-06-22 18:44:55 +03:00
if ( ext4_has_feature_metadata_csum ( sb ) | |
ext4_has_feature_ea_inode ( sb ) ) {
2012-04-30 02:27:10 +04:00
sbi - > s_chksum_driver = crypto_alloc_shash ( " crc32c " , 0 , 0 ) ;
if ( IS_ERR ( sbi - > s_chksum_driver ) ) {
ext4_msg ( sb , KERN_ERR , " Cannot load crc32c driver. " ) ;
ret = PTR_ERR ( sbi - > s_chksum_driver ) ;
sbi - > s_chksum_driver = NULL ;
goto failed_mount ;
}
}
2012-04-30 02:29:10 +04:00
/* Check superblock checksum */
if ( ! ext4_superblock_csum_verify ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " VFS: Found ext4 filesystem with "
" invalid superblock checksum. Run e2fsck? " ) ;
silent = 1 ;
2015-10-17 23:16:04 +03:00
ret = - EFSBADCRC ;
2012-04-30 02:29:10 +04:00
goto cantfind_ext4 ;
}
/* Precompute checksum seed for all metadata */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_csum_seed ( sb ) )
2015-10-17 23:16:02 +03:00
sbi - > s_csum_seed = le32_to_cpu ( es - > s_checksum_seed ) ;
2017-06-22 18:44:55 +03:00
else if ( ext4_has_metadata_csum ( sb ) | | ext4_has_feature_ea_inode ( sb ) )
2012-04-30 02:29:10 +04:00
sbi - > s_csum_seed = ext4_chksum ( sbi , ~ 0 , es - > s_uuid ,
sizeof ( es - > s_uuid ) ) ;
2006-10-11 12:20:50 +04:00
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu ( es - > s_default_mount_opts ) ;
2010-12-16 04:26:48 +03:00
set_opt ( sb , INIT_INODE_TABLE ) ;
2006-10-11 12:20:53 +04:00
if ( def_mount_opts & EXT4_DEFM_DEBUG )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DEBUG ) ;
2012-03-02 09:03:21 +04:00
if ( def_mount_opts & EXT4_DEFM_BSDGROUPS )
2010-12-16 04:26:48 +03:00
set_opt ( sb , GRPID ) ;
2006-10-11 12:20:53 +04:00
if ( def_mount_opts & EXT4_DEFM_UID16 )
2010-12-16 04:26:48 +03:00
set_opt ( sb , NO_UID32 ) ;
2011-02-24 01:51:51 +03:00
/* xattr user namespace & acls are now defaulted on */
set_opt ( sb , XATTR_USER ) ;
2008-10-11 04:02:48 +04:00
# ifdef CONFIG_EXT4_FS_POSIX_ACL
2011-02-24 01:51:51 +03:00
set_opt ( sb , POSIX_ACL ) ;
2007-02-10 12:46:13 +03:00
# endif
2014-10-30 17:53:16 +03:00
/* don't forget to enable journal_csum when metadata_csum is enabled. */
if ( ext4_has_metadata_csum ( sb ) )
set_opt ( sb , JOURNAL_CHECKSUM ) ;
2006-10-11 12:20:53 +04:00
if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_DATA )
2010-12-16 04:26:48 +03:00
set_opt ( sb , JOURNAL_DATA ) ;
2006-10-11 12:20:53 +04:00
else if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_ORDERED )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ORDERED_DATA ) ;
2006-10-11 12:20:53 +04:00
else if ( ( def_mount_opts & EXT4_DEFM_JMODE ) = = EXT4_DEFM_JMODE_WBACK )
2010-12-16 04:26:48 +03:00
set_opt ( sb , WRITEBACK_DATA ) ;
2006-10-11 12:20:53 +04:00
if ( le16_to_cpu ( sbi - > s_es - > s_errors ) = = EXT4_ERRORS_PANIC )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_PANIC ) ;
2008-01-29 07:58:26 +03:00
else if ( le16_to_cpu ( sbi - > s_es - > s_errors ) = = EXT4_ERRORS_CONTINUE )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_CONT ) ;
2008-01-29 07:58:26 +03:00
else
2010-12-16 04:26:48 +03:00
set_opt ( sb , ERRORS_RO ) ;
2014-09-02 05:34:09 +04:00
/* block_validity enabled by default; disable with noblock_validity */
set_opt ( sb , BLOCK_VALIDITY ) ;
2010-08-02 07:14:20 +04:00
if ( def_mount_opts & EXT4_DEFM_DISCARD )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DISCARD ) ;
2006-10-11 12:20:50 +04:00
2012-02-08 03:41:49 +04:00
sbi - > s_resuid = make_kuid ( & init_user_ns , le16_to_cpu ( es - > s_def_resuid ) ) ;
sbi - > s_resgid = make_kgid ( & init_user_ns , le16_to_cpu ( es - > s_def_resgid ) ) ;
2009-01-04 04:27:38 +03:00
sbi - > s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ ;
sbi - > s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME ;
sbi - > s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME ;
2006-10-11 12:20:50 +04:00
2010-08-02 07:14:20 +04:00
if ( ( def_mount_opts & EXT4_DEFM_NOBARRIER ) = = 0 )
2010-12-16 04:26:48 +03:00
set_opt ( sb , BARRIER ) ;
2006-10-11 12:20:50 +04:00
2008-07-12 03:27:31 +04:00
/*
* enable delayed allocation by default
* Use - o nodelalloc to turn it off
*/
2012-09-18 06:54:36 +04:00
if ( ! IS_EXT3_SB ( sb ) & & ! IS_EXT2_SB ( sb ) & &
2010-08-02 07:14:20 +04:00
( ( def_mount_opts & EXT4_DEFM_NODELALLOC ) = = 0 ) )
2010-12-16 04:26:48 +03:00
set_opt ( sb , DELALLOC ) ;
2008-07-12 03:27:31 +04:00
2011-05-20 21:55:16 +04:00
/*
* set default s_li_wait_mult for lazyinit , for the case there is
* no mount option specified .
*/
sbi - > s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT ;
2016-11-18 21:24:26 +03:00
if ( sbi - > s_es - > s_mount_opts [ 0 ] ) {
char * s_mount_opts = kstrndup ( sbi - > s_es - > s_mount_opts ,
sizeof ( sbi - > s_es - > s_mount_opts ) ,
GFP_KERNEL ) ;
if ( ! s_mount_opts )
goto failed_mount ;
if ( ! parse_options ( s_mount_opts , sb , & journal_devnum ,
& journal_ioprio , 0 ) ) {
ext4_msg ( sb , KERN_WARNING ,
" failed to parse options in superblock: %s " ,
s_mount_opts ) ;
}
kfree ( s_mount_opts ) ;
2010-08-02 07:14:20 +04:00
}
2012-03-05 04:27:31 +04:00
sbi - > s_def_mount_opt = sbi - > s_mount_opt ;
2009-01-06 06:46:26 +03:00
if ( ! parse_options ( ( char * ) data , sb , & journal_devnum ,
2012-02-21 02:53:04 +04:00
& journal_ioprio , 0 ) )
2006-10-11 12:20:50 +04:00
goto failed_mount ;
2011-09-04 02:22:38 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA ) {
printk_once ( KERN_WARNING " EXT4-fs: Warning: mounting "
" with data=journal disables delayed "
" allocation and O_DIRECT support! \n " ) ;
if ( test_opt2 ( sb , EXPLICIT_DELALLOC ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and delalloc " ) ;
goto failed_mount ;
}
if ( test_opt ( sb , DIOREAD_NOLOCK ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
2013-08-09 07:02:24 +04:00
" both data=journal and dioread_nolock " ) ;
2011-09-04 02:22:38 +04:00
goto failed_mount ;
}
2015-02-17 02:59:38 +03:00
if ( test_opt ( sb , DAX ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and dax " ) ;
goto failed_mount ;
}
ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.
Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.
Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.
Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.
Fixes: 4461471107b7
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-11 01:54:58 +03:00
if ( ext4_has_feature_encrypt ( sb ) ) {
ext4_msg ( sb , KERN_WARNING ,
" encrypted files will use data=ordered "
" instead of data journaling mode " ) ;
}
2011-09-04 02:22:38 +04:00
if ( test_opt ( sb , DELALLOC ) )
clear_opt ( sb , DELALLOC ) ;
2015-07-22 06:51:26 +03:00
} else {
sb - > s_iflags | = SB_I_CGROUPWB ;
2011-09-04 02:22:38 +04:00
}
2006-10-11 12:20:50 +04:00
sb - > s_flags = ( sb - > s_flags & ~ MS_POSIXACL ) |
2010-02-24 19:35:32 +03:00
( test_opt ( sb , POSIX_ACL ) ? MS_POSIXACL : 0 ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) = = EXT4_GOOD_OLD_REV & &
2015-10-17 23:18:43 +03:00
( ext4_has_compat_features ( sb ) | |
ext4_has_ro_compat_features ( sb ) | |
ext4_has_incompat_features ( sb ) ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" feature flags set on rev 0 fs, "
" running e2fsck is recommended " ) ;
2008-02-10 09:11:44 +03:00
2014-03-24 22:09:06 +04:00
if ( es - > s_creator_os = = cpu_to_le32 ( EXT4_OS_HURD ) ) {
set_opt2 ( sb , HURD_COMPAT ) ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) ) {
2014-03-24 22:09:06 +04:00
ext4_msg ( sb , KERN_ERR ,
" The Hurd can't support 64-bit file systems " ) ;
goto failed_mount ;
}
2017-06-22 18:44:55 +03:00
/*
* ea_inode feature uses l_i_version field which is not
* available in HURD_COMPAT mode .
*/
if ( ext4_has_feature_ea_inode ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" ea_inode feature is not supported for Hurd " ) ;
goto failed_mount ;
}
2014-03-24 22:09:06 +04:00
}
2011-04-19 01:29:14 +04:00
if ( IS_EXT2_SB ( sb ) ) {
if ( ext2_feature_set_ok ( sb ) )
ext4_msg ( sb , KERN_INFO , " mounting ext2 file system "
" using the ext4 subsystem " ) ;
else {
ext4_msg ( sb , KERN_ERR , " couldn't mount as ext2 due "
" to feature incompatibilities " ) ;
goto failed_mount ;
}
}
if ( IS_EXT3_SB ( sb ) ) {
if ( ext3_feature_set_ok ( sb ) )
ext4_msg ( sb , KERN_INFO , " mounting ext3 file system "
" using the ext4 subsystem " ) ;
else {
ext4_msg ( sb , KERN_ERR , " couldn't mount as ext3 due "
" to feature incompatibilities " ) ;
goto failed_mount ;
}
}
2006-10-11 12:20:50 +04:00
/*
* Check feature flags regardless of the revision level , since we
* previously didn ' t change the revision level when setting the flags ,
* so there is a chance incompat flags are set on a rev 0 filesystem .
*/
2009-08-18 08:20:23 +04:00
if ( ! ext4_feature_set_ok ( sb , ( sb - > s_flags & MS_RDONLY ) ) )
2006-10-11 12:20:50 +04:00
goto failed_mount ;
2009-08-18 08:20:23 +04:00
2012-12-20 09:07:18 +04:00
blocksize = BLOCK_SIZE < < le32_to_cpu ( es - > s_log_block_size ) ;
2006-10-11 12:20:53 +04:00
if ( blocksize < EXT4_MIN_BLOCK_SIZE | |
blocksize > EXT4_MAX_BLOCK_SIZE ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
2016-11-18 21:00:24 +03:00
" Unsupported filesystem blocksize %d (%d log_block_size) " ,
blocksize , le32_to_cpu ( es - > s_log_block_size ) ) ;
goto failed_mount ;
}
if ( le32_to_cpu ( es - > s_log_block_size ) >
( EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE ) ) {
ext4_msg ( sb , KERN_ERR ,
" Invalid log block size: %u " ,
le32_to_cpu ( es - > s_log_block_size ) ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2016-07-06 03:01:52 +03:00
if ( le16_to_cpu ( sbi - > s_es - > s_reserved_gdt_blocks ) > ( blocksize / 4 ) ) {
ext4_msg ( sb , KERN_ERR ,
" Number of reserved GDT blocks insanely large: %d " ,
le16_to_cpu ( sbi - > s_es - > s_reserved_gdt_blocks ) ) ;
goto failed_mount ;
}
2015-02-17 02:59:38 +03:00
if ( sbi - > s_mount_opt & EXT4_MOUNT_DAX ) {
2016-05-10 19:23:54 +03:00
err = bdev_dax_supported ( sb , blocksize ) ;
if ( err )
2015-02-17 02:59:38 +03:00
goto failed_mount ;
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_encrypt ( sb ) & & es - > s_encryption_level ) {
2015-04-16 08:56:00 +03:00
ext4_msg ( sb , KERN_ERR , " Unsupported encryption level %d " ,
es - > s_encryption_level ) ;
goto failed_mount ;
}
2006-10-11 12:20:50 +04:00
if ( sb - > s_blocksize ! = blocksize ) {
2008-01-29 07:58:27 +03:00
/* Validate the filesystem blocksize */
if ( ! sb_set_blocksize ( sb , blocksize ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " bad block size %d " ,
2008-01-29 07:58:27 +03:00
blocksize ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2008-07-27 00:15:44 +04:00
brelse ( bh ) ;
2006-10-11 12:21:20 +04:00
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE ;
offset = do_div ( logical_sb_block , blocksize ) ;
2014-09-05 06:36:15 +04:00
bh = sb_bread_unmovable ( sb , logical_sb_block ) ;
2006-10-11 12:20:50 +04:00
if ( ! bh ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" Can't read superblock on 2nd try " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_es = es ;
2006-10-11 12:20:53 +04:00
if ( es - > s_magic ! = cpu_to_le16 ( EXT4_SUPER_MAGIC ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" Magic mismatch, very weird! " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
}
2015-10-17 23:18:43 +03:00
has_huge_files = ext4_has_feature_huge_file ( sb ) ;
2008-10-17 06:50:48 +04:00
sbi - > s_bitmap_maxbytes = ext4_max_bitmap_size ( sb - > s_blocksize_bits ,
has_huge_files ) ;
sb - > s_maxbytes = ext4_max_size ( sb - > s_blocksize_bits , has_huge_files ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
if ( le32_to_cpu ( es - > s_rev_level ) = = EXT4_GOOD_OLD_REV ) {
sbi - > s_inode_size = EXT4_GOOD_OLD_INODE_SIZE ;
sbi - > s_first_ino = EXT4_GOOD_OLD_FIRST_INO ;
2006-10-11 12:20:50 +04:00
} else {
sbi - > s_inode_size = le16_to_cpu ( es - > s_inode_size ) ;
sbi - > s_first_ino = le32_to_cpu ( es - > s_first_ino ) ;
2006-10-11 12:20:53 +04:00
if ( ( sbi - > s_inode_size < EXT4_GOOD_OLD_INODE_SIZE ) | |
2007-07-18 17:11:02 +04:00
( ! is_power_of_2 ( sbi - > s_inode_size ) ) | |
2006-10-11 12:20:50 +04:00
( sbi - > s_inode_size > blocksize ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" unsupported inode size: %d " ,
2008-07-27 00:15:44 +04:00
sbi - > s_inode_size ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2007-07-18 17:15:20 +04:00
if ( sbi - > s_inode_size > EXT4_GOOD_OLD_INODE_SIZE )
sb - > s_time_gran = 1 < < ( EXT4_EPOCH_BITS - 2 ) ;
2006-10-11 12:20:50 +04:00
}
2009-06-04 01:59:28 +04:00
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size = le16_to_cpu ( es - > s_desc_size ) ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) ) {
2006-10-11 12:21:15 +04:00
if ( sbi - > s_desc_size < EXT4_MIN_DESC_SIZE_64BIT | |
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size > EXT4_MAX_DESC_SIZE | |
2007-10-17 10:27:14 +04:00
! is_power_of_2 ( sbi - > s_desc_size ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" unsupported descriptor size %lu " ,
2006-10-11 12:21:14 +04:00
sbi - > s_desc_size ) ;
goto failed_mount ;
}
} else
sbi - > s_desc_size = EXT4_MIN_DESC_SIZE ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
sbi - > s_blocks_per_group = le32_to_cpu ( es - > s_blocks_per_group ) ;
sbi - > s_inodes_per_group = le32_to_cpu ( es - > s_inodes_per_group ) ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:53 +04:00
sbi - > s_inodes_per_block = blocksize / EXT4_INODE_SIZE ( sb ) ;
2006-10-11 12:20:50 +04:00
if ( sbi - > s_inodes_per_block = = 0 )
2006-10-11 12:20:53 +04:00
goto cantfind_ext4 ;
2016-11-18 21:28:30 +03:00
if ( sbi - > s_inodes_per_group < sbi - > s_inodes_per_block | |
sbi - > s_inodes_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR , " invalid inodes per group: %lu \n " ,
sbi - > s_blocks_per_group ) ;
goto failed_mount ;
}
2006-10-11 12:20:50 +04:00
sbi - > s_itb_per_group = sbi - > s_inodes_per_group /
sbi - > s_inodes_per_block ;
2006-10-11 12:21:14 +04:00
sbi - > s_desc_per_block = blocksize / EXT4_DESC_SIZE ( sb ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_sbh = bh ;
sbi - > s_mount_state = le16_to_cpu ( es - > s_state ) ;
2007-10-17 10:26:25 +04:00
sbi - > s_addr_per_block_bits = ilog2 ( EXT4_ADDR_PER_BLOCK ( sb ) ) ;
sbi - > s_desc_per_block_bits = ilog2 ( EXT4_DESC_PER_BLOCK ( sb ) ) ;
2009-06-04 01:59:28 +04:00
2008-07-27 00:15:44 +04:00
for ( i = 0 ; i < 4 ; i + + )
2006-10-11 12:20:50 +04:00
sbi - > s_hash_seed [ i ] = le32_to_cpu ( es - > s_hash_seed [ i ] ) ;
sbi - > s_def_hash_version = es - > s_def_hash_version ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_dir_index ( sb ) ) {
2014-02-12 21:16:04 +04:00
i = le32_to_cpu ( es - > s_flags ) ;
if ( i & EXT2_FLAGS_UNSIGNED_HASH )
sbi - > s_hash_unsigned = 3 ;
else if ( ( i & EXT2_FLAGS_SIGNED_HASH ) = = 0 ) {
2008-10-28 20:21:44 +03:00
# ifdef __CHAR_UNSIGNED__
2014-02-12 21:16:04 +04:00
if ( ! ( sb - > s_flags & MS_RDONLY ) )
es - > s_flags | =
cpu_to_le32 ( EXT2_FLAGS_UNSIGNED_HASH ) ;
sbi - > s_hash_unsigned = 3 ;
2008-10-28 20:21:44 +03:00
# else
2014-02-12 21:16:04 +04:00
if ( ! ( sb - > s_flags & MS_RDONLY ) )
es - > s_flags | =
cpu_to_le32 ( EXT2_FLAGS_SIGNED_HASH ) ;
2008-10-28 20:21:44 +03:00
# endif
2014-02-12 21:16:04 +04:00
}
2008-10-28 20:21:44 +03:00
}
2006-10-11 12:20:50 +04:00
2011-09-10 02:34:51 +04:00
/* Handle clustersize */
clustersize = BLOCK_SIZE < < le32_to_cpu ( es - > s_log_cluster_size ) ;
2015-10-17 23:18:43 +03:00
has_bigalloc = ext4_has_feature_bigalloc ( sb ) ;
2011-09-10 02:34:51 +04:00
if ( has_bigalloc ) {
if ( clustersize < blocksize ) {
ext4_msg ( sb , KERN_ERR ,
" cluster size (%d) smaller than "
" block size (%d) " , clustersize , blocksize ) ;
goto failed_mount ;
}
2016-11-18 21:00:24 +03:00
if ( le32_to_cpu ( es - > s_log_cluster_size ) >
( EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE ) ) {
ext4_msg ( sb , KERN_ERR ,
" Invalid log cluster size: %u " ,
le32_to_cpu ( es - > s_log_cluster_size ) ) ;
goto failed_mount ;
}
2011-09-10 02:34:51 +04:00
sbi - > s_cluster_bits = le32_to_cpu ( es - > s_log_cluster_size ) -
le32_to_cpu ( es - > s_log_block_size ) ;
sbi - > s_clusters_per_group =
le32_to_cpu ( es - > s_clusters_per_group ) ;
if ( sbi - > s_clusters_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR ,
" #clusters per group too big: %lu " ,
sbi - > s_clusters_per_group ) ;
goto failed_mount ;
}
if ( sbi - > s_blocks_per_group ! =
( sbi - > s_clusters_per_group * ( clustersize / blocksize ) ) ) {
ext4_msg ( sb , KERN_ERR , " blocks per group (%lu) and "
" clusters per group (%lu) inconsistent " ,
sbi - > s_blocks_per_group ,
sbi - > s_clusters_per_group ) ;
goto failed_mount ;
}
} else {
if ( clustersize ! = blocksize ) {
ext4_warning ( sb , " fragment/cluster size (%d) != "
" block size (%d) " , clustersize ,
blocksize ) ;
clustersize = blocksize ;
}
if ( sbi - > s_blocks_per_group > blocksize * 8 ) {
ext4_msg ( sb , KERN_ERR ,
" #blocks per group too big: %lu " ,
sbi - > s_blocks_per_group ) ;
goto failed_mount ;
}
sbi - > s_clusters_per_group = sbi - > s_blocks_per_group ;
sbi - > s_cluster_bits = 0 ;
2006-10-11 12:20:50 +04:00
}
2011-09-10 02:34:51 +04:00
sbi - > s_cluster_ratio = clustersize / blocksize ;
2013-07-06 07:11:16 +04:00
/* Do we have standard group size of clustersize * 8 blocks ? */
if ( sbi - > s_blocks_per_group = = clustersize < < 3 )
set_opt2 ( sb , STD_GROUP_SIZE ) ;
2009-08-18 07:48:51 +04:00
/*
* Test whether we have more sectors than will fit in sector_t ,
* and whether the max offset is addressable by the page cache .
*/
2010-11-19 17:56:44 +03:00
err = generic_check_addressable ( sb - > s_blocksize_bits ,
2010-07-23 02:03:41 +04:00
ext4_blocks_count ( es ) ) ;
2010-11-19 17:56:44 +03:00
if ( err ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " filesystem "
2009-08-18 07:48:51 +04:00
" too large to mount safely on this system " ) ;
2006-10-11 12:20:50 +04:00
if ( sizeof ( sector_t ) < 8 )
2009-06-19 10:08:50 +04:00
ext4_msg ( sb , KERN_WARNING , " CONFIG_LBDAF not enabled " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2006-10-11 12:20:53 +04:00
if ( EXT4_BLOCKS_PER_GROUP ( sb ) = = 0 )
goto cantfind_ext4 ;
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 07:58:27 +03:00
2009-04-07 22:07:47 +04:00
/* check blocks count against device size */
blocks_count = sb - > s_bdev - > bd_inode - > i_size > > sb - > s_blocksize_bits ;
if ( blocks_count & & ext4_blocks_count ( es ) > blocks_count ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " bad geometry: block count %llu "
" exceeds size of device (%llu blocks) " ,
2009-04-07 22:07:47 +04:00
ext4_blocks_count ( es ) , blocks_count ) ;
goto failed_mount ;
}
2009-06-04 01:59:28 +04:00
/*
* It makes no sense for the first data block to be beyond the end
* of the filesystem .
*/
if ( le32_to_cpu ( es - > s_first_data_block ) > = ext4_blocks_count ( es ) ) {
2011-12-19 01:13:58 +04:00
ext4_msg ( sb , KERN_WARNING , " bad geometry: first data "
2009-06-05 01:36:36 +04:00
" block %u is beyond end of filesystem (%llu) " ,
le32_to_cpu ( es - > s_first_data_block ) ,
ext4_blocks_count ( es ) ) ;
ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.
Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4. If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:
blocks_count = (ext4_blocks_count(es) -
le32_to_cpu(es->s_first_data_block) +
EXT4_BLOCKS_PER_GROUP(sb) - 1);
This is then assigned to s_groups_count which is an unsigned long:
sbi->s_groups_count = blocks_count;
This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:
db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
EXT4_DESC_PER_BLOCK(sb);
and in this case db_count will wind up as 0 because the addition overflows
32 bits. This in turn causes the kmalloc for group_desc to be of 0 size:
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
GFP_KERNEL);
and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.
The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count. We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-29 07:58:27 +03:00
goto failed_mount ;
}
2006-10-11 12:21:10 +04:00
blocks_count = ( ext4_blocks_count ( es ) -
le32_to_cpu ( es - > s_first_data_block ) +
EXT4_BLOCKS_PER_GROUP ( sb ) - 1 ) ;
do_div ( blocks_count , EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
2009-01-06 22:53:26 +03:00
if ( blocks_count > ( ( uint64_t ) 1 < < 32 ) - EXT4_DESC_PER_BLOCK ( sb ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " groups count too large: %u "
2009-01-06 22:53:26 +03:00
" (block count %llu, first data block %u, "
2009-06-05 01:36:36 +04:00
" blocks per group %lu) " , sbi - > s_groups_count ,
2009-01-06 22:53:26 +03:00
ext4_blocks_count ( es ) ,
le32_to_cpu ( es - > s_first_data_block ) ,
EXT4_BLOCKS_PER_GROUP ( sb ) ) ;
goto failed_mount ;
}
2006-10-11 12:21:10 +04:00
sbi - > s_groups_count = blocks_count ;
ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.
This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.
This should address RH Bug 519471, ext4 bitmap allocator
must limit blocks to < 2^32
* ext4_find_goal() is modified to choose a goal < UINT_MAX,
so that our starting point is in an acceptable range.
* ext4_xattr_block_set() is modified such that the goal block
is < UINT_MAX, as above.
* ext4_mb_regular_allocator() is modified so that the group
search does not continue into groups which are too high
* ext4_mb_use_preallocated() has a check that we don't use
preallocated space which is too far out
* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs
No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes. Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.
For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 22:45:10 +04:00
sbi - > s_blockfile_groups = min_t ( ext4_group_t , sbi - > s_groups_count ,
( EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP ( sb ) ) ) ;
2006-10-11 12:20:53 +04:00
db_count = ( sbi - > s_groups_count + EXT4_DESC_PER_BLOCK ( sb ) - 1 ) /
EXT4_DESC_PER_BLOCK ( sb ) ;
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 23:08:37 +03:00
if ( ext4_has_feature_meta_bg ( sb ) ) {
2017-02-15 09:26:39 +03:00
if ( le32_to_cpu ( es - > s_first_meta_bg ) > db_count ) {
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
ext4_calculate_overhead():
buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer
blks = count_overhead(sb, i, buf);
count_overhead():
for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun
count++;
}
This can be reproduced easily for me by this script:
#!/bin/bash
rm -f fs.img
mkdir -p /mnt/ext4
fallocate -l 16M fs.img
mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
debugfs -w -R "ssv first_meta_bg 842150400" fs.img
mount -o loop fs.img /mnt/ext4
Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.
Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 23:08:37 +03:00
ext4_msg ( sb , KERN_WARNING ,
" first meta block group too large: %u "
" (group descriptor block count %u) " ,
le32_to_cpu ( es - > s_first_meta_bg ) , db_count ) ;
goto failed_mount ;
}
}
2017-05-09 01:57:09 +03:00
sbi - > s_group_desc = kvmalloc ( db_count *
2011-08-01 16:45:38 +04:00
sizeof ( struct buffer_head * ) ,
GFP_KERNEL ) ;
2006-10-11 12:20:50 +04:00
if ( sbi - > s_group_desc = = NULL ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " not enough memory " ) ;
2012-05-29 01:49:54 +04:00
ret = - ENOMEM ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
}
2009-02-16 02:07:52 +03:00
bgl_lock_init ( sbi - > s_blockgroup_lock ) ;
2006-10-11 12:20:50 +04:00
2017-04-30 07:46:35 +03:00
/* Pre-read the descriptors into the buffer cache */
for ( i = 0 ; i < db_count ; i + + ) {
block = descriptor_loc ( sb , logical_sb_block , i ) ;
sb_breadahead ( sb , block ) ;
}
2006-10-11 12:20:50 +04:00
for ( i = 0 ; i < db_count ; i + + ) {
2006-10-11 12:21:20 +04:00
block = descriptor_loc ( sb , logical_sb_block , i ) ;
2014-09-05 06:36:15 +04:00
sbi - > s_group_desc [ i ] = sb_bread_unmovable ( sb , block ) ;
2006-10-11 12:20:50 +04:00
if ( ! sbi - > s_group_desc [ i ] ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" can't read group descriptor %d " , i ) ;
2006-10-11 12:20:50 +04:00
db_count = i ;
goto failed_mount2 ;
}
}
2016-08-01 07:51:02 +03:00
if ( ! ext4_check_descriptors ( sb , logical_sb_block , & first_not_zeroed ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " group descriptors corrupted! " ) ;
2015-10-17 23:16:04 +03:00
ret = - EFSCORRUPTED ;
2014-07-11 21:55:40 +04:00
goto failed_mount2 ;
2006-10-11 12:20:50 +04:00
}
2008-07-12 03:27:31 +04:00
2014-07-11 21:55:40 +04:00
sbi - > s_gdb_count = db_count ;
2006-10-11 12:20:50 +04:00
get_random_bytes ( & sbi - > s_next_generation , sizeof ( u32 ) ) ;
spin_lock_init ( & sbi - > s_next_gen_lock ) ;
2015-01-26 22:42:31 +03:00
setup_timer ( & sbi - > s_err_report , print_daily_error_info ,
( unsigned long ) sb ) ;
2011-04-06 03:55:28 +04:00
2013-04-04 06:10:52 +04:00
/* Register extent status tree shrinker */
2014-09-02 06:26:49 +04:00
if ( ext4_es_register_shrinker ( sbi ) )
2010-11-03 19:03:21 +03:00
goto failed_mount3 ;
2008-01-29 08:19:52 +03:00
sbi - > s_stripe = ext4_get_stripe_size ( sbi ) ;
2012-08-17 17:54:17 +04:00
sbi - > s_extent_max_zeroout_kb = 32 ;
2008-01-29 08:19:52 +03:00
2014-07-11 21:55:40 +04:00
/*
* set up enough so that it can read an inode
*/
2014-09-19 01:12:30 +04:00
sb - > s_op = & ext4_sops ;
2006-10-11 12:20:53 +04:00
sb - > s_export_op = & ext4_export_ops ;
sb - > s_xattr = ext4_xattr_handlers ;
2016-07-10 21:01:03 +03:00
sb - > s_cop = & ext4_cryptops ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2006-10-11 12:20:53 +04:00
sb - > dq_op = & ext4_quota_operations ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) )
2014-10-08 20:26:54 +04:00
sb - > s_qcop = & dquot_quotactl_sysfile_ops ;
2013-03-03 02:57:08 +04:00
else
sb - > s_qcop = & ext4_qctl_operations ;
2016-01-09 00:01:22 +03:00
sb - > s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ ;
2006-10-11 12:20:50 +04:00
# endif
2017-05-10 16:06:33 +03:00
memcpy ( & sb - > s_uuid , es - > s_uuid , sizeof ( es - > s_uuid ) ) ;
2011-01-29 16:13:40 +03:00
2006-10-11 12:20:50 +04:00
INIT_LIST_HEAD ( & sbi - > s_orphan ) ; /* unlinked but open files */
2009-04-26 06:54:04 +04:00
mutex_init ( & sbi - > s_orphan_lock ) ;
2006-10-11 12:20:50 +04:00
sb - > s_root = NULL ;
needs_recovery = ( es - > s_last_orphan ! = 0 | |
2015-10-17 23:18:43 +03:00
ext4_has_feature_journal_needs_recovery ( sb ) ) ;
2006-10-11 12:20:50 +04:00
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_mmp ( sb ) & & ! ( sb - > s_flags & MS_RDONLY ) )
2011-05-25 02:31:25 +04:00
if ( ext4_multi_mount_protect ( sb , le64_to_cpu ( es - > s_mmp_block ) ) )
2014-10-30 17:53:16 +03:00
goto failed_mount3a ;
2011-05-25 02:31:25 +04:00
2006-10-11 12:20:50 +04:00
/*
* The first inode we look at is the journal inode . Don ' t try
* root first : it may be modified in the journal !
*/
2015-10-17 23:18:43 +03:00
if ( ! test_opt ( sb , NOLOAD ) & & ext4_has_feature_journal ( sb ) ) {
2017-02-05 09:26:48 +03:00
err = ext4_load_journal ( sb , es , journal_devnum ) ;
if ( err )
2014-10-30 17:53:16 +03:00
goto failed_mount3a ;
2009-01-07 08:06:22 +03:00
} else if ( test_opt ( sb , NOLOAD ) & & ! ( sb - > s_flags & MS_RDONLY ) & &
2015-10-17 23:18:43 +03:00
ext4_has_feature_journal_needs_recovery ( sb ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " required journal recovery "
" suppressed and not mounted read-only " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2006-10-11 12:20:50 +04:00
} else {
ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts
Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/ -ocommit=20,journal_async_commit
#cat /proc/mounts | grep /mnt
/dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0
But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.
This patch disallow following options for journalless configurations:
- journal_checksum
- journal_async_commit
- commit=%ld
- data={writeback,ordered,journal}
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-19 06:50:26 +03:00
/* Nojournal mode, all journal mount options are illegal */
if ( test_opt2 ( sb , EXPLICIT_JOURNAL_CHECKSUM ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_checksum, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
if ( sbi - > s_commit_interval ! = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" commit=%lu, fs mounted w/o journal " ,
sbi - > s_commit_interval / HZ ) ;
goto failed_mount_wq ;
}
if ( EXT4_MOUNT_DATA_FLAGS &
( sbi - > s_mount_opt ^ sbi - > s_def_mount_opt ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" data=, fs mounted w/o journal " ) ;
goto failed_mount_wq ;
}
sbi - > s_def_mount_opt & = EXT4_MOUNT_JOURNAL_CHECKSUM ;
clear_opt ( sb , JOURNAL_CHECKSUM ) ;
2010-12-16 04:26:48 +03:00
clear_opt ( sb , DATA_FLAGS ) ;
2009-01-07 08:06:22 +03:00
sbi - > s_journal = NULL ;
needs_recovery = 0 ;
goto no_journal ;
2006-10-11 12:20:50 +04:00
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_64bit ( sb ) & &
2007-07-18 16:37:25 +04:00
! jbd2_journal_set_features ( EXT4_SB ( sb ) - > s_journal , 0 , 0 ,
JBD2_FEATURE_INCOMPAT_64BIT ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Failed to set 64-bit journal feature " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2007-07-18 16:37:25 +04:00
}
2012-05-27 15:48:56 +04:00
if ( ! set_journal_csum_feature_set ( sb ) ) {
ext4_msg ( sb , KERN_ERR , " Failed to set journal checksum "
" feature set " ) ;
goto failed_mount_wq ;
2009-11-02 21:15:27 +03:00
}
2008-01-29 07:58:27 +03:00
2006-10-11 12:20:50 +04:00
/* We have now updated the journal if required, so we can
* validate the data journaling mode . */
switch ( test_opt ( sb , DATA_FLAGS ) ) {
case 0 :
/* No mode set, assume a default based on the journal
2006-10-11 12:21:24 +04:00
* capabilities : ORDERED_DATA if the journal can
* cope , else JOURNAL_DATA
*/
2006-10-11 12:21:01 +04:00
if ( jbd2_journal_check_available_features
( sbi - > s_journal , 0 , 0 , JBD2_FEATURE_INCOMPAT_REVOKE ) )
2010-12-16 04:26:48 +03:00
set_opt ( sb , ORDERED_DATA ) ;
2006-10-11 12:20:50 +04:00
else
2010-12-16 04:26:48 +03:00
set_opt ( sb , JOURNAL_DATA ) ;
2006-10-11 12:20:50 +04:00
break ;
2006-10-11 12:20:53 +04:00
case EXT4_MOUNT_ORDERED_DATA :
case EXT4_MOUNT_WRITEBACK_DATA :
2006-10-11 12:21:01 +04:00
if ( ! jbd2_journal_check_available_features
( sbi - > s_journal , 0 , 0 , JBD2_FEATURE_INCOMPAT_REVOKE ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Journal does not support "
" requested data journaling mode " ) ;
2010-03-05 00:14:02 +03:00
goto failed_mount_wq ;
2006-10-11 12:20:50 +04:00
}
default :
break ;
}
2016-12-04 00:20:53 +03:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA & &
test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit in data=ordered mode " ) ;
goto failed_mount_wq ;
}
2009-01-06 06:46:26 +03:00
set_task_ioprio ( sbi - > s_journal - > j_task , journal_ioprio ) ;
2006-10-11 12:20:50 +04:00
2012-02-21 02:53:02 +04:00
sbi - > s_journal - > j_commit_callback = ext4_journal_commit_callback ;
2010-11-03 19:03:21 +03:00
no_journal :
2017-06-22 18:55:14 +03:00
if ( ! test_opt ( sb , NO_MBCACHE ) ) {
sbi - > s_ea_block_cache = ext4_xattr_create_cache ( ) ;
if ( ! sbi - > s_ea_block_cache ) {
2017-06-22 18:44:55 +03:00
ext4_msg ( sb , KERN_ERR ,
2017-06-22 18:55:14 +03:00
" Failed to create ea_block_cache " ) ;
2017-06-22 18:44:55 +03:00
goto failed_mount_wq ;
}
2017-06-22 18:55:14 +03:00
if ( ext4_has_feature_ea_inode ( sb ) ) {
sbi - > s_ea_inode_cache = ext4_xattr_create_cache ( ) ;
if ( ! sbi - > s_ea_inode_cache ) {
ext4_msg ( sb , KERN_ERR ,
" Failed to create ea_inode_cache " ) ;
goto failed_mount_wq ;
}
}
2014-03-19 03:24:49 +04:00
}
2015-10-17 23:18:43 +03:00
if ( ( DUMMY_ENCRYPTION_ENABLED ( sbi ) | | ext4_has_feature_encrypt ( sb ) ) & &
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
( blocksize ! = PAGE_SIZE ) ) {
2015-06-13 06:44:33 +03:00
ext4_msg ( sb , KERN_ERR ,
" Unsupported blocksize for fs encryption " ) ;
goto failed_mount_wq ;
}
2015-10-17 23:18:43 +03:00
if ( DUMMY_ENCRYPTION_ENABLED ( sbi ) & & ! ( sb - > s_flags & MS_RDONLY ) & &
! ext4_has_feature_encrypt ( sb ) ) {
ext4_set_feature_encrypt ( sb ) ;
2015-04-16 08:56:00 +03:00
ext4_commit_super ( sb , 1 ) ;
}
2012-07-10 00:27:05 +04:00
/*
* Get the # of file system overhead blocks from the
* superblock if present .
*/
if ( es - > s_overhead_clusters )
sbi - > s_overhead = le32_to_cpu ( es - > s_overhead_clusters ) ;
else {
2012-11-09 00:16:54 +04:00
err = ext4_calculate_overhead ( sb ) ;
if ( err )
2012-07-10 00:27:05 +04:00
goto failed_mount_wq ;
}
2011-02-01 13:42:42 +03:00
/*
* The maximum number of concurrent works can be high and
* concurrency isn ' t really necessary . Limit it to 1.
*/
2013-06-04 22:21:02 +04:00
EXT4_SB ( sb ) - > rsv_conversion_wq =
alloc_workqueue ( " ext4-rsv-conversion " , WQ_MEM_RECLAIM | WQ_UNBOUND , 1 ) ;
if ( ! EXT4_SB ( sb ) - > rsv_conversion_wq ) {
printk ( KERN_ERR " EXT4-fs: failed to create workqueue \n " ) ;
2012-11-09 00:16:54 +04:00
ret = - ENOMEM ;
2013-06-04 22:21:02 +04:00
goto failed_mount4 ;
}
2006-10-11 12:20:50 +04:00
/*
2006-10-11 12:21:01 +04:00
* The jbd2_journal_load will have done any necessary log recovery ,
2006-10-11 12:20:50 +04:00
* so we can safely mount the rest of the filesystem now .
*/
2008-02-07 11:15:37 +03:00
root = ext4_iget ( sb , EXT4_ROOT_INO ) ;
if ( IS_ERR ( root ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " get root inode failed " ) ;
2008-02-07 11:15:37 +03:00
ret = PTR_ERR ( root ) ;
2011-02-28 04:42:06 +03:00
root = NULL ;
2006-10-11 12:20:50 +04:00
goto failed_mount4 ;
}
if ( ! S_ISDIR ( root - > i_mode ) | | ! root - > i_blocks | | ! root - > i_size ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " corrupt root inode, run e2fsck " ) ;
2012-01-10 00:53:24 +04:00
iput ( root ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount4 ;
}
2012-01-09 07:15:13 +04:00
sb - > s_root = d_make_root ( root ) ;
2008-02-07 11:15:37 +03:00
if ( ! sb - > s_root ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " get root dentry failed " ) ;
2008-02-07 11:15:37 +03:00
ret = - ENOMEM ;
goto failed_mount4 ;
}
2006-10-11 12:20:50 +04:00
2012-05-28 22:17:25 +04:00
if ( ext4_setup_super ( sb , es , sb - > s_flags & MS_RDONLY ) )
sb - > s_flags | = MS_RDONLY ;
2007-07-18 17:15:20 +04:00
/* determine the minimum size of new large inodes, if present */
2017-01-11 23:32:22 +03:00
if ( sbi - > s_inode_size > EXT4_GOOD_OLD_INODE_SIZE & &
sbi - > s_want_extra_isize = = 0 ) {
2007-07-18 17:15:20 +04:00
sbi - > s_want_extra_isize = sizeof ( struct ext4_inode ) -
EXT4_GOOD_OLD_INODE_SIZE ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_extra_isize ( sb ) ) {
2007-07-18 17:15:20 +04:00
if ( sbi - > s_want_extra_isize <
le16_to_cpu ( es - > s_want_extra_isize ) )
sbi - > s_want_extra_isize =
le16_to_cpu ( es - > s_want_extra_isize ) ;
if ( sbi - > s_want_extra_isize <
le16_to_cpu ( es - > s_min_extra_isize ) )
sbi - > s_want_extra_isize =
le16_to_cpu ( es - > s_min_extra_isize ) ;
}
}
/* Check if enough inode space is available */
if ( EXT4_GOOD_OLD_INODE_SIZE + sbi - > s_want_extra_isize >
sbi - > s_inode_size ) {
sbi - > s_want_extra_isize = sizeof ( struct ext4_inode ) -
EXT4_GOOD_OLD_INODE_SIZE ;
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " required extra inode space not "
" available " ) ;
2007-07-18 17:15:20 +04:00
}
2015-09-23 19:44:17 +03:00
ext4_set_resv_clusters ( sb ) ;
2013-04-10 06:11:22 +04:00
2009-05-17 23:38:01 +04:00
err = ext4_setup_system_zone ( sb ) ;
if ( err ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " failed to initialize system "
2010-05-16 21:00:00 +04:00
" zone (%d) " , err ) ;
2014-07-11 21:55:40 +04:00
goto failed_mount4a ;
}
ext4_ext_init ( sb ) ;
err = ext4_mb_init ( sb ) ;
if ( err ) {
ext4_msg ( sb , KERN_ERR , " failed to initialize mballoc (%d) " ,
err ) ;
2011-10-06 20:10:11 +04:00
goto failed_mount5 ;
2008-10-11 04:07:20 +04:00
}
2014-07-15 14:01:38 +04:00
block = ext4_count_free_clusters ( sb ) ;
ext4_free_blocks_count_set ( sbi - > s_es ,
EXT4_C2B ( sbi , block ) ) ;
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_freeclusters_counter , block ,
GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
if ( ! err ) {
unsigned long freei = ext4_count_free_inodes ( sb ) ;
sbi - > s_es - > s_free_inodes_count = cpu_to_le32 ( freei ) ;
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_freeinodes_counter , freei ,
GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
}
if ( ! err )
err = percpu_counter_init ( & sbi - > s_dirs_counter ,
2014-09-08 04:51:29 +04:00
ext4_count_dirs ( sb ) , GFP_KERNEL ) ;
2014-07-15 14:01:38 +04:00
if ( ! err )
2014-09-08 04:51:29 +04:00
err = percpu_counter_init ( & sbi - > s_dirtyclusters_counter , 0 ,
GFP_KERNEL ) ;
2016-04-26 06:22:35 +03:00
if ( ! err )
err = percpu_init_rwsem ( & sbi - > s_journal_flag_rwsem ) ;
2014-07-15 14:01:38 +04:00
if ( err ) {
ext4_msg ( sb , KERN_ERR , " insufficient memory " ) ;
goto failed_mount6 ;
}
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_flex_bg ( sb ) )
2014-07-15 14:01:38 +04:00
if ( ! ext4_fill_flex_info ( sb ) ) {
ext4_msg ( sb , KERN_ERR ,
" unable to initialize "
" flex_bg meta info! " ) ;
goto failed_mount6 ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
err = ext4_register_li_request ( sb , first_not_zeroed ) ;
if ( err )
2011-10-06 20:10:11 +04:00
goto failed_mount6 ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2015-09-23 19:44:17 +03:00
err = ext4_register_sysfs ( sb ) ;
2011-10-06 20:10:11 +04:00
if ( err )
goto failed_mount7 ;
2009-03-31 17:10:09 +04:00
2013-03-03 03:22:38 +04:00
# ifdef CONFIG_QUOTA
/* Enable quota usage during mount. */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) & & ! ( sb - > s_flags & MS_RDONLY ) ) {
2013-03-03 03:22:38 +04:00
err = ext4_enable_quotas ( sb ) ;
if ( err )
goto failed_mount8 ;
}
# endif /* CONFIG_QUOTA */
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ORPHAN_FS ;
ext4_orphan_cleanup ( sb , es ) ;
EXT4_SB ( sb ) - > s_mount_state & = ~ EXT4_ORPHAN_FS ;
2009-01-07 08:06:22 +03:00
if ( needs_recovery ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " recovery complete " ) ;
2009-01-07 08:06:22 +03:00
ext4_mark_recovery_complete ( sb , es ) ;
}
if ( EXT4_SB ( sb ) - > s_journal ) {
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA )
descr = " journalled data mode " ;
else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA )
descr = " ordered data mode " ;
else
descr = " writeback data mode " ;
} else
descr = " out journal " ;
2012-11-08 22:28:29 +04:00
if ( test_opt ( sb , DISCARD ) ) {
struct request_queue * q = bdev_get_queue ( sb - > s_bdev ) ;
if ( ! blk_queue_discard ( q ) )
ext4_msg ( sb , KERN_WARNING ,
" mounting with \" discard \" option, but "
" the device does not support discard " ) ;
}
2015-08-15 21:59:44 +03:00
if ( ___ratelimit ( & ext4_mount_msg_ratelimit , " EXT4-fs mount " ) )
ext4_msg ( sb , KERN_INFO , " mounted filesystem with%s. "
2016-11-18 21:24:26 +03:00
" Opts: %.*s%s%s " , descr ,
( int ) sizeof ( sbi - > s_es - > s_mount_opts ) ,
sbi - > s_es - > s_mount_opts ,
2015-08-15 21:59:44 +03:00
* sbi - > s_es - > s_mount_opts ? " ; " : " " , orig_data ) ;
2006-10-11 12:20:50 +04:00
2010-07-27 19:56:04 +04:00
if ( es - > s_error_count )
mod_timer ( & sbi - > s_err_report , jiffies + 300 * HZ ) ; /* 5 minutes */
2006-10-11 12:20:50 +04:00
2013-10-18 05:11:01 +04:00
/* Enable message ratelimiting. Default is 10 messages per 5 secs. */
ratelimit_state_init ( & sbi - > s_err_ratelimit_state , 5 * HZ , 10 ) ;
ratelimit_state_init ( & sbi - > s_warning_ratelimit_state , 5 * HZ , 10 ) ;
ratelimit_state_init ( & sbi - > s_msg_ratelimit_state , 5 * HZ , 10 ) ;
2010-05-16 20:00:00 +04:00
kfree ( orig_data ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
2006-10-11 12:20:53 +04:00
cantfind_ext4 :
2006-10-11 12:20:50 +04:00
if ( ! silent )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " VFS: Can't find ext4 filesystem " ) ;
2006-10-11 12:20:50 +04:00
goto failed_mount ;
2013-01-25 08:24:54 +04:00
# ifdef CONFIG_QUOTA
failed_mount8 :
2015-09-23 19:46:17 +03:00
ext4_unregister_sysfs ( sb ) ;
2013-01-25 08:24:54 +04:00
# endif
2011-10-06 20:10:11 +04:00
failed_mount7 :
ext4_unregister_li_request ( sb ) ;
failed_mount6 :
2014-07-11 21:55:40 +04:00
ext4_mb_release ( sb ) ;
2014-07-15 14:01:38 +04:00
if ( sbi - > s_flex_groups )
2014-11-20 20:19:11 +03:00
kvfree ( sbi - > s_flex_groups ) ;
2014-07-15 14:01:38 +04:00
percpu_counter_destroy ( & sbi - > s_freeclusters_counter ) ;
percpu_counter_destroy ( & sbi - > s_freeinodes_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirs_counter ) ;
percpu_counter_destroy ( & sbi - > s_dirtyclusters_counter ) ;
ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.
And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
number of clusters returned by ext4_free_clusters_after_init() is not correct.
(see ext4_bg_num_gdb_nometa())
Here is the stack backtrace:
(gdb) bt
#0 ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
#1 ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
#2 0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0,
bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
#3 0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
block_group=block_group@entry=0) at balloc.c:489
#4 0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
#5 0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
sb=0xffff880079a10000) at super.c:2143
#6 ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
...
Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 18:54:20 +04:00
failed_mount5 :
2014-07-11 21:55:40 +04:00
ext4_ext_release ( sb ) ;
ext4_release_system_zone ( sb ) ;
failed_mount4a :
2012-01-10 00:53:24 +04:00
dput ( sb - > s_root ) ;
2011-02-28 04:42:06 +03:00
sb - > s_root = NULL ;
2012-01-10 00:53:24 +04:00
failed_mount4 :
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " mount failed " ) ;
2013-06-04 22:21:02 +04:00
if ( EXT4_SB ( sb ) - > rsv_conversion_wq )
destroy_workqueue ( EXT4_SB ( sb ) - > rsv_conversion_wq ) ;
2009-09-28 23:48:41 +04:00
failed_mount_wq :
2017-06-22 18:44:55 +03:00
if ( sbi - > s_ea_inode_cache ) {
ext4_xattr_destroy_cache ( sbi - > s_ea_inode_cache ) ;
sbi - > s_ea_inode_cache = NULL ;
}
2017-06-22 18:28:55 +03:00
if ( sbi - > s_ea_block_cache ) {
ext4_xattr_destroy_cache ( sbi - > s_ea_block_cache ) ;
sbi - > s_ea_block_cache = NULL ;
2016-02-22 19:50:13 +03:00
}
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal ) {
jbd2_journal_destroy ( sbi - > s_journal ) ;
sbi - > s_journal = NULL ;
}
2014-10-30 17:53:16 +03:00
failed_mount3a :
2013-07-01 16:12:37 +04:00
ext4_es_unregister_shrinker ( sbi ) ;
2014-09-02 06:26:49 +04:00
failed_mount3 :
2013-12-09 05:52:31 +04:00
del_timer_sync ( & sbi - > s_err_report ) ;
2011-05-25 02:31:25 +04:00
if ( sbi - > s_mmp_tsk )
kthread_stop ( sbi - > s_mmp_tsk ) ;
2006-10-11 12:20:50 +04:00
failed_mount2 :
for ( i = 0 ; i < db_count ; i + + )
brelse ( sbi - > s_group_desc [ i ] ) ;
2014-11-20 20:19:11 +03:00
kvfree ( sbi - > s_group_desc ) ;
2006-10-11 12:20:50 +04:00
failed_mount :
2012-04-30 02:27:10 +04:00
if ( sbi - > s_chksum_driver )
crypto_free_shash ( sbi - > s_chksum_driver ) ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2006-10-11 12:20:50 +04:00
kfree ( sbi - > s_qf_names [ i ] ) ;
# endif
2006-10-11 12:20:53 +04:00
ext4_blkdev_remove ( sbi ) ;
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
out_fail :
sb - > s_fs_info = NULL ;
2009-05-18 07:52:44 +04:00
kfree ( sbi - > s_blockgroup_lock ) ;
2016-11-18 21:24:26 +03:00
out_free_base :
2006-10-11 12:20:50 +04:00
kfree ( sbi ) ;
2010-05-16 20:00:00 +04:00
kfree ( orig_data ) ;
2012-11-09 00:16:54 +04:00
return err ? err : ret ;
2006-10-11 12:20:50 +04:00
}
/*
* Setup any per - fs journal parameters now . We ' ll do this both on
* initial mount , once the journal has been initialised but before we ' ve
* done any recovery ; and again on any subsequent remount .
*/
2006-10-11 12:20:53 +04:00
static void ext4_init_journal_params ( struct super_block * sb , journal_t * journal )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
2009-01-04 04:27:38 +03:00
journal - > j_commit_interval = sbi - > s_commit_interval ;
journal - > j_min_batch_time = sbi - > s_min_batch_time ;
journal - > j_max_batch_time = sbi - > s_max_batch_time ;
2006-10-11 12:20:50 +04:00
2010-08-04 05:35:12 +04:00
write_lock ( & journal - > j_state_lock ) ;
2006-10-11 12:20:50 +04:00
if ( test_opt ( sb , BARRIER ) )
2006-10-11 12:21:01 +04:00
journal - > j_flags | = JBD2_BARRIER ;
2006-10-11 12:20:50 +04:00
else
2006-10-11 12:21:01 +04:00
journal - > j_flags & = ~ JBD2_BARRIER ;
2008-10-11 06:12:43 +04:00
if ( test_opt ( sb , DATA_ERR_ABORT ) )
journal - > j_flags | = JBD2_ABORT_ON_SYNCDATA_ERR ;
else
journal - > j_flags & = ~ JBD2_ABORT_ON_SYNCDATA_ERR ;
2010-08-04 05:35:12 +04:00
write_unlock ( & journal - > j_state_lock ) ;
2006-10-11 12:20:50 +04:00
}
2016-09-30 09:05:09 +03:00
static struct inode * ext4_get_journal_inode ( struct super_block * sb ,
unsigned int journal_inum )
2006-10-11 12:20:50 +04:00
{
struct inode * journal_inode ;
2016-09-30 09:05:09 +03:00
/*
* Test for the existence of a valid inode on disk . Bad things
* happen if we iget ( ) an unused inode , as the subsequent iput ( )
* will try to delete it .
*/
2008-02-07 11:15:37 +03:00
journal_inode = ext4_iget ( sb , journal_inum ) ;
if ( IS_ERR ( journal_inode ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " no journal found " ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
if ( ! journal_inode - > i_nlink ) {
make_bad_inode ( journal_inode ) ;
iput ( journal_inode ) ;
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " journal inode is deleted " ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
2008-09-09 06:25:04 +04:00
jbd_debug ( 2 , " Journal inode found at %p: %lld bytes \n " ,
2006-10-11 12:20:50 +04:00
journal_inode , journal_inode - > i_size ) ;
2008-02-07 11:15:37 +03:00
if ( ! S_ISREG ( journal_inode - > i_mode ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " invalid journal inode " ) ;
2006-10-11 12:20:50 +04:00
iput ( journal_inode ) ;
return NULL ;
}
2016-09-30 09:05:09 +03:00
return journal_inode ;
}
static journal_t * ext4_get_journal ( struct super_block * sb ,
unsigned int journal_inum )
{
struct inode * journal_inode ;
journal_t * journal ;
BUG_ON ( ! ext4_has_feature_journal ( sb ) ) ;
journal_inode = ext4_get_journal_inode ( sb , journal_inum ) ;
if ( ! journal_inode )
return NULL ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:21:01 +04:00
journal = jbd2_journal_init_inode ( journal_inode ) ;
2006-10-11 12:20:50 +04:00
if ( ! journal ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " Could not load journal inode " ) ;
2006-10-11 12:20:50 +04:00
iput ( journal_inode ) ;
return NULL ;
}
journal - > j_private = sb ;
2006-10-11 12:20:53 +04:00
ext4_init_journal_params ( sb , journal ) ;
2006-10-11 12:20:50 +04:00
return journal ;
}
2006-10-11 12:20:53 +04:00
static journal_t * ext4_get_dev_journal ( struct super_block * sb ,
2006-10-11 12:20:50 +04:00
dev_t j_dev )
{
2008-07-27 00:15:44 +04:00
struct buffer_head * bh ;
2006-10-11 12:20:50 +04:00
journal_t * journal ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t start ;
ext4_fsblk_t len ;
2006-10-11 12:20:50 +04:00
int hblock , blocksize ;
2006-10-11 12:20:53 +04:00
ext4_fsblk_t sb_block ;
2006-10-11 12:20:50 +04:00
unsigned long offset ;
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es ;
2006-10-11 12:20:50 +04:00
struct block_device * bdev ;
2015-10-17 23:18:43 +03:00
BUG_ON ( ! ext4_has_feature_journal ( sb ) ) ;
2009-01-07 08:06:22 +03:00
2009-06-05 01:36:36 +04:00
bdev = ext4_blkdev_get ( j_dev , sb ) ;
2006-10-11 12:20:50 +04:00
if ( bdev = = NULL )
return NULL ;
blocksize = sb - > s_blocksize ;
2009-05-23 01:17:49 +04:00
hblock = bdev_logical_block_size ( bdev ) ;
2006-10-11 12:20:50 +04:00
if ( blocksize < hblock ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" blocksize too small for journal device " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
2006-10-11 12:20:53 +04:00
sb_block = EXT4_MIN_BLOCK_SIZE / blocksize ;
offset = EXT4_MIN_BLOCK_SIZE % blocksize ;
2006-10-11 12:20:50 +04:00
set_blocksize ( bdev , blocksize ) ;
if ( ! ( bh = __bread ( bdev , sb_block , blocksize ) ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " couldn't read superblock of "
" external journal " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
2012-05-29 01:47:52 +04:00
es = ( struct ext4_super_block * ) ( bh - > b_data + offset ) ;
2006-10-11 12:20:53 +04:00
if ( ( le16_to_cpu ( es - > s_magic ) ! = EXT4_SUPER_MAGIC ) | |
2006-10-11 12:20:50 +04:00
! ( le32_to_cpu ( es - > s_feature_incompat ) &
2006-10-11 12:20:53 +04:00
EXT4_FEATURE_INCOMPAT_JOURNAL_DEV ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " external journal has "
" bad superblock " ) ;
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
goto out_bdev ;
}
2014-09-11 19:44:36 +04:00
if ( ( le32_to_cpu ( es - > s_feature_ro_compat ) &
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM ) & &
es - > s_checksum ! = ext4_superblock_csum ( sb , es ) ) {
ext4_msg ( sb , KERN_ERR , " external journal has "
" corrupt superblock " ) ;
brelse ( bh ) ;
goto out_bdev ;
}
2006-10-11 12:20:53 +04:00
if ( memcmp ( EXT4_SB ( sb ) - > s_es - > s_journal_uuid , es - > s_uuid , 16 ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " journal UUID does not match " ) ;
2006-10-11 12:20:50 +04:00
brelse ( bh ) ;
goto out_bdev ;
}
2006-10-11 12:21:10 +04:00
len = ext4_blocks_count ( es ) ;
2006-10-11 12:20:50 +04:00
start = sb_block + 1 ;
brelse ( bh ) ; /* we're done with the superblock */
2006-10-11 12:21:01 +04:00
journal = jbd2_journal_init_dev ( bdev , sb - > s_bdev ,
2006-10-11 12:20:50 +04:00
start , len , blocksize ) ;
if ( ! journal ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " failed to create device journal " ) ;
2006-10-11 12:20:50 +04:00
goto out_bdev ;
}
journal - > j_private = sb ;
2016-06-05 22:31:44 +03:00
ll_rw_block ( REQ_OP_READ , REQ_META | REQ_PRIO , 1 , & journal - > j_sb_buffer ) ;
2006-10-11 12:20:50 +04:00
wait_on_buffer ( journal - > j_sb_buffer ) ;
if ( ! buffer_uptodate ( journal - > j_sb_buffer ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " I/O error on journal device " ) ;
2006-10-11 12:20:50 +04:00
goto out_journal ;
}
if ( be32_to_cpu ( journal - > j_superblock - > s_nr_users ) ! = 1 ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " External journal has more than one "
" user (unsupported) - %d " ,
2006-10-11 12:20:50 +04:00
be32_to_cpu ( journal - > j_superblock - > s_nr_users ) ) ;
goto out_journal ;
}
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > journal_bdev = bdev ;
ext4_init_journal_params ( sb , journal ) ;
2006-10-11 12:20:50 +04:00
return journal ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
out_journal :
2006-10-11 12:21:01 +04:00
jbd2_journal_destroy ( journal ) ;
2006-10-11 12:20:50 +04:00
out_bdev :
2006-10-11 12:20:53 +04:00
ext4_blkdev_put ( bdev ) ;
2006-10-11 12:20:50 +04:00
return NULL ;
}
2006-10-11 12:20:53 +04:00
static int ext4_load_journal ( struct super_block * sb ,
struct ext4_super_block * es ,
2006-10-11 12:20:50 +04:00
unsigned long journal_devnum )
{
journal_t * journal ;
unsigned int journal_inum = le32_to_cpu ( es - > s_journal_inum ) ;
dev_t journal_dev ;
int err = 0 ;
int really_read_only ;
2015-10-17 23:18:43 +03:00
BUG_ON ( ! ext4_has_feature_journal ( sb ) ) ;
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:50 +04:00
if ( journal_devnum & &
journal_devnum ! = le32_to_cpu ( es - > s_journal_dev ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " external journal device major/minor "
" numbers have changed " ) ;
2006-10-11 12:20:50 +04:00
journal_dev = new_decode_dev ( journal_devnum ) ;
} else
journal_dev = new_decode_dev ( le32_to_cpu ( es - > s_journal_dev ) ) ;
really_read_only = bdev_read_only ( sb - > s_bdev ) ;
/*
* Are we loading a blank journal or performing recovery after a
* crash ? For recovery , we need to check in advance whether we
* can get read - write access to the device .
*/
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_journal_needs_recovery ( sb ) ) {
2006-10-11 12:20:50 +04:00
if ( sb - > s_flags & MS_RDONLY ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " INFO: recovery "
" required on readonly filesystem " ) ;
2006-10-11 12:20:50 +04:00
if ( really_read_only ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " write access "
" unavailable, cannot proceed " ) ;
2006-10-11 12:20:50 +04:00
return - EROFS ;
}
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " write access will "
" be enabled during recovery " ) ;
2006-10-11 12:20:50 +04:00
}
}
if ( journal_inum & & journal_dev ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " filesystem has both journal "
" and inode journals! " ) ;
2006-10-11 12:20:50 +04:00
return - EINVAL ;
}
if ( journal_inum ) {
2006-10-11 12:20:53 +04:00
if ( ! ( journal = ext4_get_journal ( sb , journal_inum ) ) )
2006-10-11 12:20:50 +04:00
return - EINVAL ;
} else {
2006-10-11 12:20:53 +04:00
if ( ! ( journal = ext4_get_dev_journal ( sb , journal_dev ) ) )
2006-10-11 12:20:50 +04:00
return - EINVAL ;
}
2009-09-29 23:51:30 +04:00
if ( ! ( journal - > j_flags & JBD2_BARRIER ) )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_INFO , " barriers disabled " ) ;
2008-09-09 07:00:52 +04:00
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal_needs_recovery ( sb ) )
2006-10-11 12:21:01 +04:00
err = jbd2_journal_wipe ( journal , ! really_read_only ) ;
2010-07-27 19:56:03 +04:00
if ( ! err ) {
char * save = kmalloc ( EXT4_S_ERR_LEN , GFP_KERNEL ) ;
if ( save )
memcpy ( save , ( ( char * ) es ) +
EXT4_S_ERR_START , EXT4_S_ERR_LEN ) ;
2006-10-11 12:21:01 +04:00
err = jbd2_journal_load ( journal ) ;
2010-07-27 19:56:03 +04:00
if ( save )
memcpy ( ( ( char * ) es ) + EXT4_S_ERR_START ,
save , EXT4_S_ERR_LEN ) ;
kfree ( save ) ;
}
2006-10-11 12:20:50 +04:00
if ( err ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " error loading journal " ) ;
2006-10-11 12:21:01 +04:00
jbd2_journal_destroy ( journal ) ;
2006-10-11 12:20:50 +04:00
return err ;
}
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_journal = journal ;
ext4_clear_journal_err ( sb , es ) ;
2006-10-11 12:20:50 +04:00
2010-10-28 05:30:06 +04:00
if ( ! really_read_only & & journal_devnum & &
2006-10-11 12:20:50 +04:00
journal_devnum ! = le32_to_cpu ( es - > s_journal_dev ) ) {
es - > s_journal_dev = cpu_to_le32 ( journal_devnum ) ;
/* Make sure we flush the recovery flag to disk. */
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 1 ) ;
2006-10-11 12:20:50 +04:00
}
return 0 ;
}
2009-05-01 08:33:44 +04:00
static int ext4_commit_super ( struct super_block * sb , int sync )
2006-10-11 12:20:50 +04:00
{
2009-05-01 08:33:44 +04:00
struct ext4_super_block * es = EXT4_SB ( sb ) - > s_es ;
2006-10-11 12:20:53 +04:00
struct buffer_head * sbh = EXT4_SB ( sb ) - > s_sbh ;
2009-01-10 03:40:58 +03:00
int error = 0 ;
2006-10-11 12:20:50 +04:00
2015-08-16 17:03:57 +03:00
if ( ! sbh | | block_device_ejected ( sb ) )
2009-01-10 03:40:58 +03:00
return error ;
2009-09-11 01:31:04 +04:00
/*
* If the file system is mounted read - only , don ' t update the
* superblock write time . This avoids updating the superblock
* write time when we are mounting the root file system
* read / only but we need to replay the journal ; at that point ,
* for people who are east of GMT and who make their clock
* tick in localtime for Windows bug - for - bug compatibility ,
* the clock is set in the future , and this will cause e2fsck
* to complain and force a full file system check .
*/
if ( ! ( sb - > s_flags & MS_RDONLY ) )
es - > s_wtime = cpu_to_le32 ( get_seconds ( ) ) ;
2010-07-27 19:56:08 +04:00
if ( sb - > s_bdev - > bd_part )
es - > s_kbytes_written =
cpu_to_le64 ( EXT4_SB ( sb ) - > s_kbytes_written +
2009-03-01 03:39:58 +03:00
( ( part_stat_read ( sb - > s_bdev - > bd_part , sectors [ 1 ] ) -
EXT4_SB ( sb ) - > s_sectors_written_start ) > > 1 ) ) ;
2010-07-27 19:56:08 +04:00
else
es - > s_kbytes_written =
cpu_to_le64 ( EXT4_SB ( sb ) - > s_kbytes_written ) ;
2014-07-15 14:01:38 +04:00
if ( percpu_counter_initialized ( & EXT4_SB ( sb ) - > s_freeclusters_counter ) )
ext4_free_blocks_count_set ( es ,
2011-09-10 02:56:51 +04:00
EXT4_C2B ( EXT4_SB ( sb ) , percpu_counter_sum_positive (
& EXT4_SB ( sb ) - > s_freeclusters_counter ) ) ) ;
2014-07-15 14:01:38 +04:00
if ( percpu_counter_initialized ( & EXT4_SB ( sb ) - > s_freeinodes_counter ) )
es - > s_free_inodes_count =
cpu_to_le32 ( percpu_counter_sum_positive (
2010-11-03 19:03:21 +03:00
& EXT4_SB ( sb ) - > s_freeinodes_counter ) ) ;
2006-10-11 12:20:50 +04:00
BUFFER_TRACE ( sbh , " marking dirty " ) ;
2012-10-10 09:06:58 +04:00
ext4_superblock_csum_set ( sb ) ;
ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:
BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
Call Trace:
[<ffffffff81318c89>] dump_stack+0x67/0x9e
[<ffffffff810810b0>] ___might_sleep+0xf0/0x140
[<ffffffff81081153>] __might_sleep+0x53/0xb0
[<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
[<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
[<ffffffff81081153>] ? __might_sleep+0x53/0xb0
[<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320
Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.
This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990614 was trying to address),
but a Warning is better than BUG.
Fixes: 4743f83990614
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-14 06:02:29 +03:00
if ( sync )
lock_buffer ( sbh ) ;
2016-07-04 17:24:52 +03:00
if ( buffer_write_io_error ( sbh ) ) {
/*
* Oh , dear . A previous attempt to write the
* superblock failed . This could happen because the
* USB device was yanked out . Or it could happen to
* be a transient write error and maybe the block will
* be remapped . Nothing we can do but to retry the
* write and hope for the best .
*/
ext4_msg ( sb , KERN_ERR , " previous I/O error to "
" superblock detected " ) ;
clear_buffer_write_io_error ( sbh ) ;
set_buffer_uptodate ( sbh ) ;
}
2006-10-11 12:20:50 +04:00
mark_buffer_dirty ( sbh ) ;
2008-10-07 05:35:40 +04:00
if ( sync ) {
ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:
BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
Call Trace:
[<ffffffff81318c89>] dump_stack+0x67/0x9e
[<ffffffff810810b0>] ___might_sleep+0xf0/0x140
[<ffffffff81081153>] __might_sleep+0x53/0xb0
[<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
[<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
[<ffffffff81081153>] ? __might_sleep+0x53/0xb0
[<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320
Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.
This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990614 was trying to address),
but a Warning is better than BUG.
Fixes: 4743f83990614
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-14 06:02:29 +03:00
unlock_buffer ( sbh ) ;
ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.
[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-23 16:46:11 +03:00
error = __sync_dirty_buffer ( sbh ,
2017-05-04 17:58:03 +03:00
REQ_SYNC | ( test_opt ( sb , BARRIER ) ? REQ_FUA : 0 ) ) ;
2009-01-10 03:40:58 +03:00
if ( error )
return error ;
error = buffer_write_io_error ( sbh ) ;
if ( error ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR , " I/O error while writing "
" superblock " ) ;
2008-10-07 05:35:40 +04:00
clear_buffer_write_io_error ( sbh ) ;
set_buffer_uptodate ( sbh ) ;
}
}
2009-01-10 03:40:58 +03:00
return error ;
2006-10-11 12:20:50 +04:00
}
/*
* Have we just finished recovery ? If so , and if we are mounting ( or
* remounting ) the filesystem readonly , then we will end up with a
* consistent fs on disk . Record that fact .
*/
2008-07-27 00:15:44 +04:00
static void ext4_mark_recovery_complete ( struct super_block * sb ,
struct ext4_super_block * es )
2006-10-11 12:20:50 +04:00
{
2006-10-11 12:20:53 +04:00
journal_t * journal = EXT4_SB ( sb ) - > s_journal ;
2006-10-11 12:20:50 +04:00
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal ( sb ) ) {
2009-01-07 08:06:22 +03:00
BUG_ON ( journal ! = NULL ) ;
return ;
}
2006-10-11 12:21:01 +04:00
jbd2_journal_lock_updates ( journal ) ;
2008-10-11 04:29:21 +04:00
if ( jbd2_journal_flush ( journal ) < 0 )
goto out ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_journal_needs_recovery ( sb ) & &
2006-10-11 12:20:50 +04:00
sb - > s_flags & MS_RDONLY ) {
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 1 ) ;
2006-10-11 12:20:50 +04:00
}
2008-10-11 04:29:21 +04:00
out :
2006-10-11 12:21:01 +04:00
jbd2_journal_unlock_updates ( journal ) ;
2006-10-11 12:20:50 +04:00
}
/*
* If we are mounting ( or read - write remounting ) a filesystem whose journal
* has recorded an error from a previous lifetime , move that error to the
* main filesystem now .
*/
2008-07-27 00:15:44 +04:00
static void ext4_clear_journal_err ( struct super_block * sb ,
struct ext4_super_block * es )
2006-10-11 12:20:50 +04:00
{
journal_t * journal ;
int j_errno ;
const char * errstr ;
2015-10-17 23:18:43 +03:00
BUG_ON ( ! ext4_has_feature_journal ( sb ) ) ;
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:53 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2006-10-11 12:20:50 +04:00
/*
* Now check for any error status which may have been recorded in the
2006-10-11 12:20:53 +04:00
* journal by a prior ext4_error ( ) or ext4_abort ( )
2006-10-11 12:20:50 +04:00
*/
2006-10-11 12:21:01 +04:00
j_errno = jbd2_journal_errno ( journal ) ;
2006-10-11 12:20:50 +04:00
if ( j_errno ) {
char nbuf [ 16 ] ;
2006-10-11 12:20:53 +04:00
errstr = ext4_decode_error ( sb , j_errno , nbuf ) ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb , " Filesystem error recorded "
2006-10-11 12:20:50 +04:00
" from previous mount: %s " , errstr ) ;
2010-02-15 22:19:27 +03:00
ext4_warning ( sb , " Marking fs in need of filesystem check. " ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
EXT4_SB ( sb ) - > s_mount_state | = EXT4_ERROR_FS ;
es - > s_state | = cpu_to_le16 ( EXT4_ERROR_FS ) ;
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 1 ) ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:21:01 +04:00
jbd2_journal_clear_err ( journal ) ;
2012-08-06 03:04:57 +04:00
jbd2_journal_update_sb_errno ( journal ) ;
2006-10-11 12:20:50 +04:00
}
}
/*
* Force the running and committing transactions to commit ,
* and wait on the commit .
*/
2006-10-11 12:20:53 +04:00
int ext4_force_commit ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
journal_t * journal ;
if ( sb - > s_flags & MS_RDONLY )
return 0 ;
2006-10-11 12:20:53 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2013-01-29 06:41:02 +04:00
return ext4_journal_force_commit ( journal ) ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static int ext4_sync_fs ( struct super_block * sb , int wait )
2006-10-11 12:20:50 +04:00
{
2008-11-04 02:10:55 +03:00
int ret = 0 ;
2009-02-10 14:46:05 +03:00
tid_t target ;
2013-06-13 06:25:07 +04:00
bool needs_barrier = false ;
2009-09-28 23:48:29 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
2017-02-05 09:28:48 +03:00
if ( unlikely ( ext4_forced_shutdown ( EXT4_SB ( sb ) ) ) )
return 0 ;
2009-06-17 19:48:11 +04:00
trace_ext4_sync_fs ( sb , wait ) ;
2013-06-04 22:21:02 +04:00
flush_workqueue ( sbi - > rsv_conversion_wq ) ;
2012-07-03 18:45:29 +04:00
/*
* Writeback quota in non - journalled quota case - journalled quota has
* no dirty dquots
*/
dquot_writeback_dquots ( sb , - 1 ) ;
2013-06-13 06:25:07 +04:00
/*
* Data writeback is possible w / o journal transaction , so barrier must
* being sent at the end of the function . But we can skip it if
* transaction_commit will do it for us .
*/
2014-09-19 00:12:37 +04:00
if ( sbi - > s_journal ) {
target = jbd2_get_latest_transaction ( sbi - > s_journal ) ;
if ( wait & & sbi - > s_journal - > j_flags & JBD2_BARRIER & &
! jbd2_trans_will_send_data_barrier ( sbi - > s_journal , target ) )
needs_barrier = true ;
if ( jbd2_journal_start_commit ( sbi - > s_journal , & target ) ) {
if ( wait )
ret = jbd2_log_wait_commit ( sbi - > s_journal ,
target ) ;
}
} else if ( wait & & test_opt ( sb , BARRIER ) )
2013-06-13 06:25:07 +04:00
needs_barrier = true ;
if ( needs_barrier ) {
int err ;
err = blkdev_issue_flush ( sb - > s_bdev , GFP_KERNEL , NULL ) ;
if ( ! ret )
ret = err ;
2009-01-07 08:06:22 +03:00
}
2013-06-13 06:25:07 +04:00
return ret ;
}
2006-10-11 12:20:50 +04:00
/*
* LVM calls this function before a ( read - only ) snapshot is created . This
* gives us a chance to flush the journal completely and mark the fs clean .
2011-04-11 06:06:07 +04:00
*
* Note that only this function cannot bring a filesystem to be in a clean
2012-06-12 18:20:38 +04:00
* state independently . It relies on upper layer to stop all data & metadata
* modifications .
2006-10-11 12:20:50 +04:00
*/
2009-01-10 03:40:58 +03:00
static int ext4_freeze ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2009-01-10 03:40:58 +03:00
int error = 0 ;
journal_t * journal ;
2006-10-11 12:20:50 +04:00
2009-05-01 20:52:25 +04:00
if ( sb - > s_flags & MS_RDONLY )
return 0 ;
2006-10-11 12:20:50 +04:00
2009-05-01 20:52:25 +04:00
journal = EXT4_SB ( sb ) - > s_journal ;
2008-10-11 04:29:21 +04:00
2014-09-19 01:12:02 +04:00
if ( journal ) {
/* Now we set up the journal barrier. */
jbd2_journal_lock_updates ( journal ) ;
2006-10-11 12:20:50 +04:00
2014-09-19 01:12:02 +04:00
/*
* Don ' t clear the needs_recovery flag if we failed to
* flush the journal .
*/
error = jbd2_journal_flush ( journal ) ;
if ( error < 0 )
goto out ;
2015-08-15 17:45:06 +03:00
/* Journal blocked and flushed, clear needs_recovery flag. */
2015-10-17 23:18:43 +03:00
ext4_clear_feature_journal_needs_recovery ( sb ) ;
2014-09-19 01:12:02 +04:00
}
2009-05-01 20:52:25 +04:00
error = ext4_commit_super ( sb , 1 ) ;
2010-05-16 10:00:00 +04:00
out :
2014-09-19 01:12:02 +04:00
if ( journal )
/* we rely on upper layer to stop further updates */
jbd2_journal_unlock_updates ( journal ) ;
2010-05-16 10:00:00 +04:00
return error ;
2006-10-11 12:20:50 +04:00
}
/*
* Called by LVM after the snapshot is done . We need to reset the RECOVER
* flag here , even though the filesystem is not technically dirty yet .
*/
2009-01-10 03:40:58 +03:00
static int ext4_unfreeze ( struct super_block * sb )
2006-10-11 12:20:50 +04:00
{
2017-02-06 03:47:14 +03:00
if ( ( sb - > s_flags & MS_RDONLY ) | | ext4_forced_shutdown ( EXT4_SB ( sb ) ) )
2009-05-01 20:52:25 +04:00
return 0 ;
2015-08-15 17:45:06 +03:00
if ( EXT4_SB ( sb ) - > s_journal ) {
/* Reset the needs_recovery flag before the fs is unlocked. */
2015-10-17 23:18:43 +03:00
ext4_set_feature_journal_needs_recovery ( sb ) ;
2015-08-15 17:45:06 +03:00
}
2009-05-01 20:52:25 +04:00
ext4_commit_super ( sb , 1 ) ;
2009-01-10 03:40:58 +03:00
return 0 ;
2006-10-11 12:20:50 +04:00
}
2010-12-16 04:28:48 +03:00
/*
* Structure to save mount options for ext4_remount ' s benefit
*/
struct ext4_mount_options {
unsigned long s_mount_opt ;
2010-12-16 04:30:48 +03:00
unsigned long s_mount_opt2 ;
2012-02-08 03:41:49 +04:00
kuid_t s_resuid ;
kgid_t s_resgid ;
2010-12-16 04:28:48 +03:00
unsigned long s_commit_interval ;
u32 s_min_batch_time , s_max_batch_time ;
# ifdef CONFIG_QUOTA
int s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
char * s_qf_names [ EXT4_MAXQUOTAS ] ;
2010-12-16 04:28:48 +03:00
# endif
} ;
2008-07-27 00:15:44 +04:00
static int ext4_remount ( struct super_block * sb , int * flags , char * data )
2006-10-11 12:20:50 +04:00
{
2008-07-27 00:15:44 +04:00
struct ext4_super_block * es ;
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2006-10-11 12:20:50 +04:00
unsigned long old_sb_flags ;
2006-10-11 12:20:53 +04:00
struct ext4_mount_options old_opts ;
2010-05-19 15:16:40 +04:00
int enable_quota = 0 ;
2008-07-26 22:34:21 +04:00
ext4_group_t g ;
2009-01-06 06:46:26 +03:00
unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO ;
2011-05-25 02:31:25 +04:00
int err = 0 ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
2013-01-25 08:24:58 +04:00
int i , j ;
2006-10-11 12:20:50 +04:00
# endif
2010-05-16 20:00:00 +04:00
char * orig_data = kstrdup ( data , GFP_KERNEL ) ;
2006-10-11 12:20:50 +04:00
/* Store the original options */
old_sb_flags = sb - > s_flags ;
old_opts . s_mount_opt = sbi - > s_mount_opt ;
2010-12-16 04:30:48 +03:00
old_opts . s_mount_opt2 = sbi - > s_mount_opt2 ;
2006-10-11 12:20:50 +04:00
old_opts . s_resuid = sbi - > s_resuid ;
old_opts . s_resgid = sbi - > s_resgid ;
old_opts . s_commit_interval = sbi - > s_commit_interval ;
2009-01-04 04:27:38 +03:00
old_opts . s_min_batch_time = sbi - > s_min_batch_time ;
old_opts . s_max_batch_time = sbi - > s_max_batch_time ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
old_opts . s_jquota_fmt = sbi - > s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2013-01-25 08:24:58 +04:00
if ( sbi - > s_qf_names [ i ] ) {
old_opts . s_qf_names [ i ] = kstrdup ( sbi - > s_qf_names [ i ] ,
GFP_KERNEL ) ;
if ( ! old_opts . s_qf_names [ i ] ) {
for ( j = 0 ; j < i ; j + + )
kfree ( old_opts . s_qf_names [ j ] ) ;
2013-03-03 02:13:55 +04:00
kfree ( orig_data ) ;
2013-01-25 08:24:58 +04:00
return - ENOMEM ;
}
} else
old_opts . s_qf_names [ i ] = NULL ;
2006-10-11 12:20:50 +04:00
# endif
2009-01-06 06:46:26 +03:00
if ( sbi - > s_journal & & sbi - > s_journal - > j_task - > io_context )
journal_ioprio = sbi - > s_journal - > j_task - > io_context - > ioprio ;
2006-10-11 12:20:50 +04:00
2012-02-21 02:53:04 +04:00
if ( ! parse_options ( data , sb , NULL , & journal_ioprio , 1 ) ) {
2006-10-11 12:20:50 +04:00
err = - EINVAL ;
goto restore_opts ;
}
2014-10-30 17:53:16 +03:00
if ( ( old_opts . s_mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM ) ^
2014-11-26 00:20:50 +03:00
test_opt ( sb , JOURNAL_CHECKSUM ) ) {
ext4_msg ( sb , KERN_ERR , " changing journal_checksum "
2015-02-13 07:07:37 +03:00
" during remount not supported; ignoring " ) ;
sbi - > s_mount_opt ^ = EXT4_MOUNT_JOURNAL_CHECKSUM ;
2014-10-30 17:53:16 +03:00
}
2013-08-09 07:02:24 +04:00
if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_JOURNAL_DATA ) {
if ( test_opt2 ( sb , EXPLICIT_DELALLOC ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and delalloc " ) ;
err = - EINVAL ;
goto restore_opts ;
}
if ( test_opt ( sb , DIOREAD_NOLOCK ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and dioread_nolock " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2015-02-17 02:59:38 +03:00
if ( test_opt ( sb , DAX ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" both data=journal and dax " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2016-12-04 00:20:53 +03:00
} else if ( test_opt ( sb , DATA_FLAGS ) = = EXT4_MOUNT_ORDERED_DATA ) {
if ( test_opt ( sb , JOURNAL_ASYNC_COMMIT ) ) {
ext4_msg ( sb , KERN_ERR , " can't mount with "
" journal_async_commit in data=ordered mode " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2015-02-17 02:59:38 +03:00
}
2017-06-22 18:55:14 +03:00
if ( ( sbi - > s_mount_opt ^ old_opts . s_mount_opt ) & EXT4_MOUNT_NO_MBCACHE ) {
ext4_msg ( sb , KERN_ERR , " can't enable nombcache during remount " ) ;
err = - EINVAL ;
goto restore_opts ;
}
2015-02-17 02:59:38 +03:00
if ( ( sbi - > s_mount_opt ^ old_opts . s_mount_opt ) & EXT4_MOUNT_DAX ) {
ext4_msg ( sb , KERN_WARNING , " warning: refusing change of "
" dax flag with busy inodes while remounting " ) ;
sbi - > s_mount_opt ^ = EXT4_MOUNT_DAX ;
2013-08-09 07:02:24 +04:00
}
2009-06-13 18:09:36 +04:00
if ( sbi - > s_mount_flags & EXT4_MF_FS_ABORTED )
2010-06-29 19:07:07 +04:00
ext4_abort ( sb , " Abort forced by user " ) ;
2006-10-11 12:20:50 +04:00
sb - > s_flags = ( sb - > s_flags & ~ MS_POSIXACL ) |
2010-02-24 19:35:32 +03:00
( test_opt ( sb , POSIX_ACL ) ? MS_POSIXACL : 0 ) ;
2006-10-11 12:20:50 +04:00
es = sbi - > s_es ;
2009-01-06 06:46:26 +03:00
if ( sbi - > s_journal ) {
2009-01-07 08:06:22 +03:00
ext4_init_journal_params ( sb , sbi - > s_journal ) ;
2009-01-06 06:46:26 +03:00
set_task_ioprio ( sbi - > s_journal - > j_task , journal_ioprio ) ;
}
2006-10-11 12:20:50 +04:00
2015-06-23 18:03:54 +03:00
if ( * flags & MS_LAZYTIME )
sb - > s_flags | = MS_LAZYTIME ;
2012-02-21 02:53:04 +04:00
if ( ( * flags & MS_RDONLY ) ! = ( sb - > s_flags & MS_RDONLY ) ) {
2009-06-13 18:09:36 +04:00
if ( sbi - > s_mount_flags & EXT4_MF_FS_ABORTED ) {
2006-10-11 12:20:50 +04:00
err = - EROFS ;
goto restore_opts ;
}
if ( * flags & MS_RDONLY ) {
2014-03-14 06:49:42 +04:00
err = sync_filesystem ( sb ) ;
if ( err < 0 )
goto restore_opts ;
2010-05-19 15:16:41 +04:00
err = dquot_suspend ( sb , - 1 ) ;
if ( err < 0 )
2010-05-19 15:16:40 +04:00
goto restore_opts ;
2006-10-11 12:20:50 +04:00
/*
* First of all , the unconditional stuff we have to do
* to disable replay of the journal when we next remount
*/
sb - > s_flags | = MS_RDONLY ;
/*
* OK , test if we are remounting a valid rw partition
* readonly , and if so set the rdonly flag and then
* mark the partition as valid again .
*/
2006-10-11 12:20:53 +04:00
if ( ! ( es - > s_state & cpu_to_le16 ( EXT4_VALID_FS ) ) & &
( sbi - > s_mount_state & EXT4_VALID_FS ) )
2006-10-11 12:20:50 +04:00
es - > s_state = cpu_to_le16 ( sbi - > s_mount_state ) ;
2009-05-01 09:59:42 +04:00
if ( sbi - > s_journal )
2009-01-07 08:06:22 +03:00
ext4_mark_recovery_complete ( sb , es ) ;
2006-10-11 12:20:50 +04:00
} else {
2009-08-18 08:20:23 +04:00
/* Make sure we can mount this feature set readwrite */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_readonly ( sb ) | |
2015-02-13 06:31:21 +03:00
! ext4_feature_set_ok ( sb , 0 ) ) {
2006-10-11 12:20:50 +04:00
err = - EROFS ;
goto restore_opts ;
}
2008-07-26 22:34:21 +04:00
/*
* Make sure the group descriptor checksums
2009-06-04 01:59:28 +04:00
* are sane . If they aren ' t , refuse to remount r / w .
2008-07-26 22:34:21 +04:00
*/
for ( g = 0 ; g < sbi - > s_groups_count ; g + + ) {
struct ext4_group_desc * gdp =
ext4_get_group_desc ( sb , g , NULL ) ;
2012-04-30 02:45:10 +04:00
if ( ! ext4_group_desc_csum_verify ( sb , g , gdp ) ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_ERR ,
" ext4_remount: Checksum for group %u failed (%u!=%u) " ,
2015-10-17 23:18:43 +03:00
g , le16_to_cpu ( ext4_group_desc_csum ( sb , g , gdp ) ) ,
2008-07-26 22:34:21 +04:00
le16_to_cpu ( gdp - > bg_checksum ) ) ;
2015-10-17 23:16:04 +03:00
err = - EFSBADCRC ;
2008-07-26 22:34:21 +04:00
goto restore_opts ;
}
}
2007-02-10 12:46:08 +03:00
/*
* If we have an unprocessed orphan list hanging
* around from a previously readonly bdev mount ,
* require a full umount / remount for now .
*/
if ( es - > s_last_orphan ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " Couldn't "
2007-02-10 12:46:08 +03:00
" remount RDWR because of unprocessed "
" orphan inode list. Please "
2009-06-05 01:36:36 +04:00
" umount/remount instead " ) ;
2007-02-10 12:46:08 +03:00
err = - EINVAL ;
goto restore_opts ;
}
2006-10-11 12:20:50 +04:00
/*
* Mounting a RDONLY partition read - write , so reread
* and store the current valid flag . ( It may have
* been changed by e2fsck since we originally mounted
* the partition . )
*/
2009-01-07 08:06:22 +03:00
if ( sbi - > s_journal )
ext4_clear_journal_err ( sb , es ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_mount_state = le16_to_cpu ( es - > s_state ) ;
2008-07-27 00:15:44 +04:00
if ( ! ext4_setup_super ( sb , es , 0 ) )
2006-10-11 12:20:50 +04:00
sb - > s_flags & = ~ MS_RDONLY ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_mmp ( sb ) )
2011-05-25 02:31:25 +04:00
if ( ext4_multi_mount_protect ( sb ,
le64_to_cpu ( es - > s_mmp_block ) ) ) {
err = - EROFS ;
goto restore_opts ;
}
2010-05-19 15:16:40 +04:00
enable_quota = 1 ;
2006-10-11 12:20:50 +04:00
}
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
/*
* Reinitialize lazy itable initialization thread based on
* current settings
*/
if ( ( sb - > s_flags & MS_RDONLY ) | | ! test_opt ( sb , INIT_INODE_TABLE ) )
ext4_unregister_li_request ( sb ) ;
else {
ext4_group_t first_not_zeroed ;
first_not_zeroed = ext4_has_uninit_itable ( sb ) ;
ext4_register_li_request ( sb , first_not_zeroed ) ;
}
2009-05-17 23:38:01 +04:00
ext4_setup_system_zone ( sb ) ;
2012-12-25 23:08:16 +04:00
if ( sbi - > s_journal = = NULL & & ! ( old_sb_flags & MS_RDONLY ) )
2009-05-01 08:33:44 +04:00
ext4_commit_super ( sb , 1 ) ;
2009-01-07 08:06:22 +03:00
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
/* Release old quota file names */
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + )
2013-01-25 08:24:58 +04:00
kfree ( old_opts . s_qf_names [ i ] ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( enable_quota ) {
if ( sb_any_quota_suspended ( sb ) )
dquot_resume ( sb , - 1 ) ;
2015-10-17 23:18:43 +03:00
else if ( ext4_has_feature_quota ( sb ) ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
err = ext4_enable_quotas ( sb ) ;
2012-08-18 03:08:42 +04:00
if ( err )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
goto restore_opts ;
}
}
2006-10-11 12:20:50 +04:00
# endif
2010-05-16 20:00:00 +04:00
2015-02-02 08:37:02 +03:00
* flags = ( * flags & ~ MS_LAZYTIME ) | ( sb - > s_flags & MS_LAZYTIME ) ;
2010-05-16 20:00:00 +04:00
ext4_msg ( sb , KERN_INFO , " re-mounted. Opts: %s " , orig_data ) ;
kfree ( orig_data ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
2009-06-04 01:59:28 +04:00
2006-10-11 12:20:50 +04:00
restore_opts :
sb - > s_flags = old_sb_flags ;
sbi - > s_mount_opt = old_opts . s_mount_opt ;
2010-12-16 04:30:48 +03:00
sbi - > s_mount_opt2 = old_opts . s_mount_opt2 ;
2006-10-11 12:20:50 +04:00
sbi - > s_resuid = old_opts . s_resuid ;
sbi - > s_resgid = old_opts . s_resgid ;
sbi - > s_commit_interval = old_opts . s_commit_interval ;
2009-01-04 04:27:38 +03:00
sbi - > s_min_batch_time = old_opts . s_min_batch_time ;
sbi - > s_max_batch_time = old_opts . s_max_batch_time ;
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_QUOTA
sbi - > s_jquota_fmt = old_opts . s_jquota_fmt ;
2014-09-11 19:15:15 +04:00
for ( i = 0 ; i < EXT4_MAXQUOTAS ; i + + ) {
2013-01-25 08:24:58 +04:00
kfree ( sbi - > s_qf_names [ i ] ) ;
2006-10-11 12:20:50 +04:00
sbi - > s_qf_names [ i ] = old_opts . s_qf_names [ i ] ;
}
# endif
2010-05-16 20:00:00 +04:00
kfree ( orig_data ) ;
2006-10-11 12:20:50 +04:00
return err ;
}
2016-01-09 00:01:22 +03:00
# ifdef CONFIG_QUOTA
static int ext4_statfs_project ( struct super_block * sb ,
kprojid_t projid , struct kstatfs * buf )
{
struct kqid qid ;
struct dquot * dquot ;
u64 limit ;
u64 curblock ;
qid = make_kqid_projid ( projid ) ;
dquot = dqget ( sb , qid ) ;
if ( IS_ERR ( dquot ) )
return PTR_ERR ( dquot ) ;
spin_lock ( & dq_data_lock ) ;
limit = ( dquot - > dq_dqb . dqb_bsoftlimit ?
dquot - > dq_dqb . dqb_bsoftlimit :
dquot - > dq_dqb . dqb_bhardlimit ) > > sb - > s_blocksize_bits ;
if ( limit & & buf - > f_blocks > limit ) {
curblock = dquot - > dq_dqb . dqb_curspace > > sb - > s_blocksize_bits ;
buf - > f_blocks = limit ;
buf - > f_bfree = buf - > f_bavail =
( buf - > f_blocks > curblock ) ?
( buf - > f_blocks - curblock ) : 0 ;
}
limit = dquot - > dq_dqb . dqb_isoftlimit ?
dquot - > dq_dqb . dqb_isoftlimit :
dquot - > dq_dqb . dqb_ihardlimit ;
if ( limit & & buf - > f_files > limit ) {
buf - > f_files = limit ;
buf - > f_ffree =
( buf - > f_files > dquot - > dq_dqb . dqb_curinodes ) ?
( buf - > f_files - dquot - > dq_dqb . dqb_curinodes ) : 0 ;
}
spin_unlock ( & dq_data_lock ) ;
dqput ( dquot ) ;
return 0 ;
}
# endif
2008-07-27 00:15:44 +04:00
static int ext4_statfs ( struct dentry * dentry , struct kstatfs * buf )
2006-10-11 12:20:50 +04:00
{
struct super_block * sb = dentry - > d_sb ;
2006-10-11 12:20:53 +04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_super_block * es = sbi - > s_es ;
2013-04-10 06:11:22 +04:00
ext4_fsblk_t overhead = 0 , resv_blocks ;
2006-12-07 07:35:29 +03:00
u64 fsid ;
2011-05-25 02:30:07 +04:00
s64 bfree ;
2013-04-10 06:11:22 +04:00
resv_blocks = EXT4_C2B ( sbi , atomic64_read ( & sbi - > s_resv_clusters ) ) ;
2006-10-11 12:20:50 +04:00
2012-07-10 00:27:05 +04:00
if ( ! test_opt ( sb , MINIX_DF ) )
overhead = sbi - > s_overhead ;
2006-10-11 12:20:50 +04:00
2006-10-11 12:20:53 +04:00
buf - > f_type = EXT4_SUPER_MAGIC ;
2006-10-11 12:20:50 +04:00
buf - > f_bsize = sb - > s_blocksize ;
2012-11-08 19:33:36 +04:00
buf - > f_blocks = ext4_blocks_count ( es ) - EXT4_C2B ( sbi , overhead ) ;
2011-09-10 02:56:51 +04:00
bfree = percpu_counter_sum_positive ( & sbi - > s_freeclusters_counter ) -
percpu_counter_sum_positive ( & sbi - > s_dirtyclusters_counter ) ;
2011-05-25 02:30:07 +04:00
/* prevent underflow in case that few free space is available */
2011-09-10 02:56:51 +04:00
buf - > f_bfree = EXT4_C2B ( sbi , max_t ( s64 , bfree , 0 ) ) ;
2013-04-10 06:11:22 +04:00
buf - > f_bavail = buf - > f_bfree -
( ext4_r_blocks_count ( es ) + resv_blocks ) ;
if ( buf - > f_bfree < ( ext4_r_blocks_count ( es ) + resv_blocks ) )
2006-10-11 12:20:50 +04:00
buf - > f_bavail = 0 ;
buf - > f_files = le32_to_cpu ( es - > s_inodes_count ) ;
2007-10-17 10:25:44 +04:00
buf - > f_ffree = percpu_counter_sum_positive ( & sbi - > s_freeinodes_counter ) ;
2006-10-11 12:20:53 +04:00
buf - > f_namelen = EXT4_NAME_LEN ;
2006-12-07 07:35:29 +03:00
fsid = le64_to_cpup ( ( void * ) es - > s_uuid ) ^
le64_to_cpup ( ( void * ) es - > s_uuid + sizeof ( u64 ) ) ;
buf - > f_fsid . val [ 0 ] = fsid & 0xFFFFFFFFUL ;
buf - > f_fsid . val [ 1 ] = ( fsid > > 32 ) & 0xFFFFFFFFUL ;
2009-06-04 01:59:28 +04:00
2016-01-09 00:01:22 +03:00
# ifdef CONFIG_QUOTA
if ( ext4_test_inode_flag ( dentry - > d_inode , EXT4_INODE_PROJINHERIT ) & &
sb_has_quota_limits_enabled ( sb , PRJQUOTA ) )
ext4_statfs_project ( sb , EXT4_I ( dentry - > d_inode ) - > i_projid , buf ) ;
# endif
2006-10-11 12:20:50 +04:00
return 0 ;
}
2009-06-04 01:59:28 +04:00
/* Helper function for writing quotas on sync - we need to start transaction
* before quota file is locked for write . Otherwise the are possible deadlocks :
2006-10-11 12:20:50 +04:00
* Process 1 Process 2
2006-10-11 12:20:53 +04:00
* ext4_create ( ) quota_sync ( )
2009-01-26 19:04:39 +03:00
* jbd2_journal_start ( ) write_dquot ( )
2010-03-03 17:05:07 +03:00
* dquot_initialize ( ) down ( dqio_mutex )
2006-10-11 12:21:01 +04:00
* down ( dqio_mutex ) jbd2_journal_start ( )
2006-10-11 12:20:50 +04:00
*
*/
# ifdef CONFIG_QUOTA
static inline struct inode * dquot_to_inode ( struct dquot * dquot )
{
2012-09-16 14:56:19 +04:00
return sb_dqopt ( dquot - > dq_sb ) - > files [ dquot - > dq_id . type ] ;
2006-10-11 12:20:50 +04:00
}
2006-10-11 12:20:53 +04:00
static int ext4_write_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
struct inode * inode ;
inode = dquot_to_inode ( dquot ) ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_TRANS_BLOCKS ( dquot - > dq_sb ) ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_commit ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_acquire_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( dquot_to_inode ( dquot ) , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_INIT_BLOCKS ( dquot - > dq_sb ) ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_acquire ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_release_dquot ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( dquot_to_inode ( dquot ) , EXT4_HT_QUOTA ,
2009-06-04 01:59:28 +04:00
EXT4_QUOTA_DEL_BLOCKS ( dquot - > dq_sb ) ) ;
2007-09-12 02:23:29 +04:00
if ( IS_ERR ( handle ) ) {
/* Release dquot anyway to avoid endless cycle in dqput() */
dquot_release ( dquot ) ;
2006-10-11 12:20:50 +04:00
return PTR_ERR ( handle ) ;
2007-09-12 02:23:29 +04:00
}
2006-10-11 12:20:50 +04:00
ret = dquot_release ( dquot ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
2006-10-11 12:20:53 +04:00
static int ext4_mark_dquot_dirty ( struct dquot * dquot )
2006-10-11 12:20:50 +04:00
{
2013-03-03 02:57:08 +04:00
struct super_block * sb = dquot - > dq_sb ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2008-05-14 05:27:55 +04:00
/* Are we journaling quotas? */
2015-10-17 23:18:43 +03:00
if ( ext4_has_feature_quota ( sb ) | |
2013-03-03 02:57:08 +04:00
sbi - > s_qf_names [ USRQUOTA ] | | sbi - > s_qf_names [ GRPQUOTA ] ) {
2006-10-11 12:20:50 +04:00
dquot_mark_dquot_dirty ( dquot ) ;
2006-10-11 12:20:53 +04:00
return ext4_write_dquot ( dquot ) ;
2006-10-11 12:20:50 +04:00
} else {
return dquot_mark_dquot_dirty ( dquot ) ;
}
}
2006-10-11 12:20:53 +04:00
static int ext4_write_info ( struct super_block * sb , int type )
2006-10-11 12:20:50 +04:00
{
int ret , err ;
handle_t * handle ;
/* Data block + inode block */
2015-03-18 01:25:59 +03:00
handle = ext4_journal_start ( d_inode ( sb - > s_root ) , EXT4_HT_QUOTA , 2 ) ;
2006-10-11 12:20:50 +04:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
ret = dquot_commit_info ( sb , type ) ;
2006-10-11 12:20:53 +04:00
err = ext4_journal_stop ( handle ) ;
2006-10-11 12:20:50 +04:00
if ( ! ret )
ret = err ;
return ret ;
}
/*
* Turn on quotas during mount time - we need to find
* the quota file and such . . .
*/
2006-10-11 12:20:53 +04:00
static int ext4_quota_on_mount ( struct super_block * sb , int type )
2006-10-11 12:20:50 +04:00
{
2010-05-19 15:16:45 +04:00
return dquot_quota_on_mount ( sb , EXT4_SB ( sb ) - > s_qf_names [ type ] ,
EXT4_SB ( sb ) - > s_jquota_fmt , type ) ;
2006-10-11 12:20:50 +04:00
}
2016-04-01 08:31:28 +03:00
static void lockdep_set_quota_inode ( struct inode * inode , int subclass )
{
struct ext4_inode_info * ei = EXT4_I ( inode ) ;
/* The first argument of lockdep_set_subclass has to be
* * exactly * the same as the argument to init_rwsem ( ) - - - in
* this case , in init_once ( ) - - - or lockdep gets unhappy
* because the name of the lock is set using the
* stringification of the argument to init_rwsem ( ) .
*/
( void ) ei ; /* shut up clang warning if !CONFIG_LOCKDEP */
lockdep_set_subclass ( & ei - > i_data_sem , subclass ) ;
}
2006-10-11 12:20:50 +04:00
/*
* Standard function to be called on quota_on
*/
2006-10-11 12:20:53 +04:00
static int ext4_quota_on ( struct super_block * sb , int type , int format_id ,
2016-11-21 03:49:34 +03:00
const struct path * path )
2006-10-11 12:20:50 +04:00
{
int err ;
if ( ! test_opt ( sb , QUOTA ) )
return - EINVAL ;
2008-05-14 03:11:51 +04:00
2006-10-11 12:20:50 +04:00
/* Quotafile not on the same filesystem? */
2011-12-08 03:16:57 +04:00
if ( path - > dentry - > d_sb ! = sb )
2006-10-11 12:20:50 +04:00
return - EXDEV ;
2008-05-14 03:11:51 +04:00
/* Journaling quota? */
if ( EXT4_SB ( sb ) - > s_qf_names [ type ] ) {
2008-07-27 00:15:44 +04:00
/* Quotafile not in fs root? */
2010-09-15 19:38:58 +04:00
if ( path - > dentry - > d_parent ! = sb - > s_root )
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING ,
" Quota file not on filesystem root. "
" Journaled quota will not work " ) ;
2008-07-27 00:15:44 +04:00
}
2008-05-14 03:11:51 +04:00
/*
* When we journal data on quota file , we have to flush journal to see
* all updates to the file when we bypass pagecache . . .
*/
2009-01-07 08:06:22 +03:00
if ( EXT4_SB ( sb ) - > s_journal & &
2015-03-18 01:25:59 +03:00
ext4_should_journal_data ( d_inode ( path - > dentry ) ) ) {
2008-05-14 03:11:51 +04:00
/*
* We don ' t need to lock updates but journal_flush ( ) could
* otherwise be livelocked . . .
*/
jbd2_journal_lock_updates ( EXT4_SB ( sb ) - > s_journal ) ;
2008-10-11 04:29:21 +04:00
err = jbd2_journal_flush ( EXT4_SB ( sb ) - > s_journal ) ;
2008-05-14 03:11:51 +04:00
jbd2_journal_unlock_updates ( EXT4_SB ( sb ) - > s_journal ) ;
2010-09-15 19:38:58 +04:00
if ( err )
2008-10-11 04:29:21 +04:00
return err ;
2008-05-14 03:11:51 +04:00
}
2017-04-06 16:40:06 +03:00
2016-04-01 08:31:28 +03:00
lockdep_set_quota_inode ( path - > dentry - > d_inode , I_DATA_SEM_QUOTA ) ;
err = dquot_quota_on ( sb , type , format_id , path ) ;
2017-04-06 16:40:06 +03:00
if ( err ) {
2016-04-01 08:31:28 +03:00
lockdep_set_quota_inode ( path - > dentry - > d_inode ,
I_DATA_SEM_NORMAL ) ;
2017-04-06 16:40:06 +03:00
} else {
struct inode * inode = d_inode ( path - > dentry ) ;
handle_t * handle ;
2017-04-24 17:49:16 +03:00
/*
* Set inode flags to prevent userspace from messing with quota
* files . If this fails , we return success anyway since quotas
* are already enabled and this is not a hard failure .
*/
2017-04-06 16:40:06 +03:00
inode_lock ( inode ) ;
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA , 1 ) ;
if ( IS_ERR ( handle ) )
goto unlock_inode ;
EXT4_I ( inode ) - > i_flags | = EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL ;
inode_set_flags ( inode , S_NOATIME | S_IMMUTABLE ,
S_NOATIME | S_IMMUTABLE ) ;
ext4_mark_inode_dirty ( handle , inode ) ;
ext4_journal_stop ( handle ) ;
unlock_inode :
inode_unlock ( inode ) ;
}
2016-04-01 08:31:28 +03:00
return err ;
2006-10-11 12:20:50 +04:00
}
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
static int ext4_quota_enable ( struct super_block * sb , int type , int format_id ,
unsigned int flags )
{
int err ;
struct inode * qf_inode ;
2014-09-11 19:15:15 +04:00
unsigned long qf_inums [ EXT4_MAXQUOTAS ] = {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_usr_quota_inum ) ,
2016-01-09 00:01:22 +03:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_grp_quota_inum ) ,
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_prj_quota_inum )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
} ;
2015-10-17 23:18:43 +03:00
BUG_ON ( ! ext4_has_feature_quota ( sb ) ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( ! qf_inums [ type ] )
return - EPERM ;
qf_inode = ext4_iget ( sb , qf_inums [ type ] ) ;
if ( IS_ERR ( qf_inode ) ) {
ext4_error ( sb , " Bad quota inode # %lu " , qf_inums [ type ] ) ;
return PTR_ERR ( qf_inode ) ;
}
2013-04-09 17:21:41 +04:00
/* Don't account quota for quota files to avoid recursion */
qf_inode - > i_flags | = S_NOQUOTA ;
2016-04-01 08:31:28 +03:00
lockdep_set_quota_inode ( qf_inode , I_DATA_SEM_QUOTA ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
err = dquot_enable ( qf_inode , type , format_id , flags ) ;
iput ( qf_inode ) ;
2016-04-01 08:31:28 +03:00
if ( err )
lockdep_set_quota_inode ( qf_inode , I_DATA_SEM_NORMAL ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
return err ;
}
/* Enable usage tracking for all quota types. */
static int ext4_enable_quotas ( struct super_block * sb )
{
int type , err = 0 ;
2014-09-11 19:15:15 +04:00
unsigned long qf_inums [ EXT4_MAXQUOTAS ] = {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_usr_quota_inum ) ,
2016-01-09 00:01:22 +03:00
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_grp_quota_inum ) ,
le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_prj_quota_inum )
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
} ;
2016-09-06 06:08:16 +03:00
bool quota_mopt [ EXT4_MAXQUOTAS ] = {
test_opt ( sb , USRQUOTA ) ,
test_opt ( sb , GRPQUOTA ) ,
test_opt ( sb , PRJQUOTA ) ,
} ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
sb_dqopt ( sb ) - > flags | = DQUOT_QUOTA_SYS_FILE ;
2014-09-11 19:15:15 +04:00
for ( type = 0 ; type < EXT4_MAXQUOTAS ; type + + ) {
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( qf_inums [ type ] ) {
err = ext4_quota_enable ( sb , type , QFMT_VFS_V1 ,
2016-09-06 06:08:16 +03:00
DQUOT_USAGE_ENABLED |
( quota_mopt [ type ] ? DQUOT_LIMITS_ENABLED : 0 ) ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
if ( err ) {
ext4_warning ( sb ,
2013-01-25 08:24:54 +04:00
" Failed to enable quota tracking "
" (type=%d, err=%d). Please run "
" e2fsck to fix. " , type , err ) ;
ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.
It is based on the proposal at:
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4
This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields. Some changes introduced by this patch that should be pointed
out are:
1) Two new ext4-superblock fields - s_usr_quota_inum and
s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
for tracking group quota. The superblock fields can be set to use
other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
superblock, the quota usage tracking is turned on at mount time. On
'quotaon' ioctl, the quota limits enforcement is turned
on. 'quotaoff' ioctl turns off only the limits enforcement in this
case.
4) When QUOTA feature is in use, the quota mount options 'quota',
'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
quota inodes. The default reserved inodes will not be visible to user
as regular files.
6) The quota-tools will need to be modified to support hidden quota
files on ext4. E2fsprogs will also include support for creating and
fixing quota files.
7) Support is only for the new V2 quota file format.
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-23 04:21:31 +04:00
return err ;
}
}
}
return 0 ;
}
2010-08-02 01:48:36 +04:00
static int ext4_quota_off ( struct super_block * sb , int type )
{
2011-04-04 23:33:39 +04:00
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
handle_t * handle ;
2017-04-06 16:40:06 +03:00
int err ;
2011-04-04 23:33:39 +04:00
2010-11-08 21:47:33 +03:00
/* Force all delayed allocation blocks to be allocated.
* Caller already holds s_umount sem */
if ( test_opt ( sb , DELALLOC ) )
2010-08-02 01:48:36 +04:00
sync_filesystem ( sb ) ;
2017-04-06 16:40:06 +03:00
if ( ! inode | | ! igrab ( inode ) )
2011-05-16 17:59:13 +04:00
goto out ;
2017-04-06 16:40:06 +03:00
err = dquot_quota_off ( sb , type ) ;
2017-05-22 05:31:23 +03:00
if ( err | | ext4_has_feature_quota ( sb ) )
2017-04-06 16:40:06 +03:00
goto out_put ;
inode_lock ( inode ) ;
2017-04-24 17:49:16 +03:00
/*
* Update modification times of quota files when userspace can
* start looking at them . If we fail , we return success anyway since
* this is not a hard failure and quotas are already disabled .
*/
2013-02-09 06:59:22 +04:00
handle = ext4_journal_start ( inode , EXT4_HT_QUOTA , 1 ) ;
2011-04-04 23:33:39 +04:00
if ( IS_ERR ( handle ) )
2017-04-06 16:40:06 +03:00
goto out_unlock ;
EXT4_I ( inode ) - > i_flags & = ~ ( EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL ) ;
inode_set_flags ( inode , 0 , S_NOATIME | S_IMMUTABLE ) ;
2016-11-15 05:40:10 +03:00
inode - > i_mtime = inode - > i_ctime = current_time ( inode ) ;
2011-04-04 23:33:39 +04:00
ext4_mark_inode_dirty ( handle , inode ) ;
ext4_journal_stop ( handle ) ;
2017-04-06 16:40:06 +03:00
out_unlock :
inode_unlock ( inode ) ;
out_put :
2017-05-22 05:31:23 +03:00
lockdep_set_quota_inode ( inode , I_DATA_SEM_NORMAL ) ;
2017-04-06 16:40:06 +03:00
iput ( inode ) ;
return err ;
2011-04-04 23:33:39 +04:00
out :
2010-08-02 01:48:36 +04:00
return dquot_quota_off ( sb , type ) ;
}
2006-10-11 12:20:50 +04:00
/* Read data from quotafile - avoid pagecache and such because we cannot afford
* acquiring the locks . . . As quota files are never truncated and quota code
2011-03-31 05:57:33 +04:00
* itself serializes the operations ( and no one else should touch the files )
2006-10-11 12:20:50 +04:00
* we don ' t have to be afraid of races */
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_read ( struct super_block * sb , int type , char * data ,
2006-10-11 12:20:50 +04:00
size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
2008-01-29 07:58:27 +03:00
ext4_lblk_t blk = off > > EXT4_BLOCK_SIZE_BITS ( sb ) ;
2006-10-11 12:20:50 +04:00
int offset = off & ( sb - > s_blocksize - 1 ) ;
int tocopy ;
size_t toread ;
struct buffer_head * bh ;
loff_t i_size = i_size_read ( inode ) ;
if ( off > i_size )
return 0 ;
if ( off + len > i_size )
len = i_size - off ;
toread = len ;
while ( toread > 0 ) {
tocopy = sb - > s_blocksize - offset < toread ?
sb - > s_blocksize - offset : toread ;
2014-08-30 04:52:15 +04:00
bh = ext4_bread ( NULL , inode , blk , 0 ) ;
if ( IS_ERR ( bh ) )
return PTR_ERR ( bh ) ;
2006-10-11 12:20:50 +04:00
if ( ! bh ) /* A hole? */
memset ( data , 0 , tocopy ) ;
else
memcpy ( data , bh - > b_data + offset , tocopy ) ;
brelse ( bh ) ;
offset = 0 ;
toread - = tocopy ;
data + = tocopy ;
blk + + ;
}
return len ;
}
/* Write to quotafile (we know the transaction is already started and has
* enough credits ) */
2006-10-11 12:20:53 +04:00
static ssize_t ext4_quota_write ( struct super_block * sb , int type ,
2006-10-11 12:20:50 +04:00
const char * data , size_t len , loff_t off )
{
struct inode * inode = sb_dqopt ( sb ) - > files [ type ] ;
2008-01-29 07:58:27 +03:00
ext4_lblk_t blk = off > > EXT4_BLOCK_SIZE_BITS ( sb ) ;
2014-08-30 04:52:15 +04:00
int err , offset = off & ( sb - > s_blocksize - 1 ) ;
2015-06-21 08:25:29 +03:00
int retries = 0 ;
2006-10-11 12:20:50 +04:00
struct buffer_head * bh ;
handle_t * handle = journal_current_handle ( ) ;
2009-01-07 08:06:22 +03:00
if ( EXT4_SB ( sb ) - > s_journal & & ! handle ) {
2009-06-05 01:36:36 +04:00
ext4_msg ( sb , KERN_WARNING , " Quota write (off=%llu, len=%llu) "
" cancelled because transaction is not started " ,
2007-09-12 02:23:29 +04:00
( unsigned long long ) off , ( unsigned long long ) len ) ;
return - EIO ;
}
2010-03-02 16:08:51 +03:00
/*
* Since we account only one data block in transaction credits ,
* then it is impossible to cross a block boundary .
*/
if ( sb - > s_blocksize - offset < len ) {
ext4_msg ( sb , KERN_WARNING , " Quota write (off=%llu, len=%llu) "
" cancelled because not block aligned " ,
( unsigned long long ) off , ( unsigned long long ) len ) ;
return - EIO ;
}
2015-06-21 08:25:29 +03:00
do {
bh = ext4_bread ( handle , inode , blk ,
EXT4_GET_BLOCKS_CREATE |
EXT4_GET_BLOCKS_METADATA_NOFAIL ) ;
} while ( IS_ERR ( bh ) & & ( PTR_ERR ( bh ) = = - ENOSPC ) & &
ext4_should_retry_alloc ( inode - > i_sb , & retries ) ) ;
2014-08-30 04:52:15 +04:00
if ( IS_ERR ( bh ) )
return PTR_ERR ( bh ) ;
2010-03-02 16:08:51 +03:00
if ( ! bh )
goto out ;
2014-05-13 06:06:43 +04:00
BUFFER_TRACE ( bh , " get write access " ) ;
2010-07-27 19:56:07 +04:00
err = ext4_journal_get_write_access ( handle , bh ) ;
if ( err ) {
brelse ( bh ) ;
2014-08-30 04:52:15 +04:00
return err ;
2006-10-11 12:20:50 +04:00
}
2010-03-02 16:08:51 +03:00
lock_buffer ( bh ) ;
memcpy ( bh - > b_data + offset , data , len ) ;
flush_dcache_page ( bh - > b_page ) ;
unlock_buffer ( bh ) ;
2010-07-27 19:56:07 +04:00
err = ext4_handle_dirty_metadata ( handle , NULL , bh ) ;
2010-03-02 16:08:51 +03:00
brelse ( bh ) ;
2006-10-11 12:20:50 +04:00
out :
2010-03-02 16:08:51 +03:00
if ( inode - > i_size < off + len ) {
i_size_write ( inode , off + len ) ;
2006-10-11 12:20:53 +04:00
EXT4_I ( inode ) - > i_disksize = inode - > i_size ;
2011-04-04 23:33:39 +04:00
ext4_mark_inode_dirty ( handle , inode ) ;
2006-10-11 12:20:50 +04:00
}
2010-03-02 16:08:51 +03:00
return len ;
2006-10-11 12:20:50 +04:00
}
2016-04-01 19:00:03 +03:00
static int ext4_get_next_id ( struct super_block * sb , struct kqid * qid )
{
const struct quota_format_ops * ops ;
if ( ! sb_has_quota_loaded ( sb , qid - > type ) )
return - ESRCH ;
ops = sb_dqopt ( sb ) - > ops [ qid - > type ] ;
if ( ! ops | | ! ops - > get_next_id )
return - ENOSYS ;
return dquot_get_next_id ( sb , qid ) ;
}
2006-10-11 12:20:50 +04:00
# endif
2010-07-25 00:46:55 +04:00
static struct dentry * ext4_mount ( struct file_system_type * fs_type , int flags ,
const char * dev_name , void * data )
2006-10-11 12:20:50 +04:00
{
2010-07-25 00:46:55 +04:00
return mount_bdev ( fs_type , flags , dev_name , data , ext4_fill_super ) ;
2006-10-11 12:20:50 +04:00
}
2015-06-18 17:52:29 +03:00
# if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
2009-12-07 22:08:51 +03:00
static inline void register_as_ext2 ( void )
{
int err = register_filesystem ( & ext2_fs_type ) ;
if ( err )
printk ( KERN_WARNING
" EXT4-fs: Unable to register as ext2 (%d) \n " , err ) ;
}
static inline void unregister_as_ext2 ( void )
{
unregister_filesystem ( & ext2_fs_type ) ;
}
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb )
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext2_incompat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
if ( sb - > s_flags & MS_RDONLY )
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext2_ro_compat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
return 1 ;
}
2009-12-07 22:08:51 +03:00
# else
static inline void register_as_ext2 ( void ) { }
static inline void unregister_as_ext2 ( void ) { }
2011-04-19 01:29:14 +04:00
static inline int ext2_feature_set_ok ( struct super_block * sb ) { return 0 ; }
2009-12-07 22:08:51 +03:00
# endif
static inline void register_as_ext3 ( void )
{
int err = register_filesystem ( & ext3_fs_type ) ;
if ( err )
printk ( KERN_WARNING
" EXT4-fs: Unable to register as ext3 (%d) \n " , err ) ;
}
static inline void unregister_as_ext3 ( void )
{
unregister_filesystem ( & ext3_fs_type ) ;
}
2011-04-19 01:29:14 +04:00
static inline int ext3_feature_set_ok ( struct super_block * sb )
{
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext3_incompat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
2015-10-17 23:18:43 +03:00
if ( ! ext4_has_feature_journal ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
if ( sb - > s_flags & MS_RDONLY )
return 1 ;
2015-10-17 23:18:43 +03:00
if ( ext4_has_unknown_ext3_ro_compat_features ( sb ) )
2011-04-19 01:29:14 +04:00
return 0 ;
return 1 ;
}
2009-12-07 22:08:51 +03:00
2008-10-11 04:02:48 +04:00
static struct file_system_type ext4_fs_type = {
. owner = THIS_MODULE ,
. name = " ext4 " ,
2010-07-25 00:46:55 +04:00
. mount = ext4_mount ,
2008-10-11 04:02:48 +04:00
. kill_sb = kill_block_super ,
. fs_flags = FS_REQUIRES_DEV ,
} ;
2013-03-03 07:39:14 +04:00
MODULE_ALIAS_FS ( " ext4 " ) ;
2008-10-11 04:02:48 +04:00
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
/* Shared across all ext4 file systems */
wait_queue_head_t ext4__ioend_wq [ EXT4_WQ_HASH_SZ ] ;
2010-10-28 05:30:14 +04:00
static int __init ext4_init_fs ( void )
2006-10-11 12:20:50 +04:00
{
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
int i , err ;
2008-01-29 08:19:52 +03:00
2015-08-15 21:59:44 +03:00
ratelimit_state_init ( & ext4_mount_msg_ratelimit , 30 * HZ , 64 ) ;
2012-03-21 06:05:02 +04:00
ext4_li_info = NULL ;
mutex_init ( & ext4_li_mtx ) ;
ext4: ensure Inode flags consistency are checked at build time
Flags being used by atomic operations in inode flags (e.g.
ext4_test_inode_flag(), should be consistent with that actually stored
in inodes, i.e.: EXT4_XXX_FL.
It ensures that this consistency is checked at build-time, not at
run-time.
Currently, the flags consistency are being checked at run-time, but,
there is no real reason to not do a build-time check instead of a
run-time check. The code is comparing macro defined values with enum
type variables, where both are constants, so, there is no problem in
comparing constants at build-time.
enum variables are treated as constants by the C compiler, according
to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
sec. 6.2.5, item 16), so, there is no real problem in comparing an
enumeration type at build time
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-11 01:30:45 +04:00
/* Build-time check for flags consistency */
2010-05-17 06:00:00 +04:00
ext4_check_flag_values ( ) ;
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
2016-03-09 06:44:50 +03:00
for ( i = 0 ; i < EXT4_WQ_HASH_SZ ; i + + )
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
init_waitqueue_head ( & ext4__ioend_wq [ i ] ) ;
2012-11-09 06:57:32 +04:00
err = ext4_init_es ( ) ;
2009-05-17 23:38:01 +04:00
if ( err )
return err ;
2012-11-09 06:57:32 +04:00
err = ext4_init_pageio ( ) ;
if ( err )
2015-09-23 19:44:17 +03:00
goto out5 ;
2012-11-09 06:57:32 +04:00
2010-10-28 05:30:14 +04:00
err = ext4_init_system_zone ( ) ;
2010-10-28 05:30:10 +04:00
if ( err )
2015-09-23 19:44:17 +03:00
goto out4 ;
2010-10-28 05:30:05 +04:00
2015-09-23 19:44:17 +03:00
err = ext4_init_sysfs ( ) ;
2011-02-03 22:33:49 +03:00
if ( err )
2015-09-23 19:44:17 +03:00
goto out3 ;
2010-10-28 05:30:05 +04:00
2010-10-28 05:30:14 +04:00
err = ext4_init_mballoc ( ) ;
2008-01-29 08:19:52 +03:00
if ( err )
goto out2 ;
2006-10-11 12:20:50 +04:00
err = init_inodecache ( ) ;
if ( err )
goto out1 ;
2009-12-07 22:08:51 +03:00
register_as_ext3 ( ) ;
2011-04-19 01:29:14 +04:00
register_as_ext2 ( ) ;
2008-10-11 04:02:48 +04:00
err = register_filesystem ( & ext4_fs_type ) ;
2006-10-11 12:20:50 +04:00
if ( err )
goto out ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
2006-10-11 12:20:50 +04:00
return 0 ;
out :
2009-12-07 22:08:51 +03:00
unregister_as_ext2 ( ) ;
unregister_as_ext3 ( ) ;
2006-10-11 12:20:50 +04:00
destroy_inodecache ( ) ;
out1 :
2010-10-28 05:30:14 +04:00
ext4_exit_mballoc ( ) ;
2014-03-19 03:24:49 +04:00
out2 :
2015-09-23 19:44:17 +03:00
ext4_exit_sysfs ( ) ;
out3 :
2010-10-28 05:30:14 +04:00
ext4_exit_system_zone ( ) ;
2015-09-23 19:44:17 +03:00
out4 :
2010-10-28 05:30:14 +04:00
ext4_exit_pageio ( ) ;
2015-09-23 19:44:17 +03:00
out5 :
2012-11-09 06:57:32 +04:00
ext4_exit_es ( ) ;
2006-10-11 12:20:50 +04:00
return err ;
}
2010-10-28 05:30:14 +04:00
static void __exit ext4_exit_fs ( void )
2006-10-11 12:20:50 +04:00
{
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-28 05:30:05 +04:00
ext4_destroy_lazyinit_thread ( ) ;
2009-12-07 22:08:51 +03:00
unregister_as_ext2 ( ) ;
unregister_as_ext3 ( ) ;
2008-10-11 04:02:48 +04:00
unregister_filesystem ( & ext4_fs_type ) ;
2006-10-11 12:20:50 +04:00
destroy_inodecache ( ) ;
2010-10-28 05:30:14 +04:00
ext4_exit_mballoc ( ) ;
2015-09-23 19:44:17 +03:00
ext4_exit_sysfs ( ) ;
2010-10-28 05:30:14 +04:00
ext4_exit_system_zone ( ) ;
ext4_exit_pageio ( ) ;
2013-07-26 23:21:11 +04:00
ext4_exit_es ( ) ;
2006-10-11 12:20:50 +04:00
}
MODULE_AUTHOR ( " Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others " ) ;
2009-01-06 22:53:16 +03:00
MODULE_DESCRIPTION ( " Fourth Extended Filesystem " ) ;
2006-10-11 12:20:50 +04:00
MODULE_LICENSE ( " GPL " ) ;
2010-10-28 05:30:14 +04:00
module_init ( ext4_init_fs )
module_exit ( ext4_exit_fs )