2006-10-11 01:20:50 -07:00
/*
2006-10-11 01:20:53 -07:00
* linux / fs / ext4 / ialloc . c
2006-10-11 01:20:50 -07:00
*
* Copyright ( C ) 1992 , 1993 , 1994 , 1995
* Remy Card ( card @ masi . ibp . fr )
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie ( Paris VI )
*
* BSD ufs - inspired inode and directory allocation by
* Stephen Tweedie ( sct @ redhat . com ) , 1993
* Big - endian to little - endian byte - swapping / bitmaps by
* David S . Miller ( davem @ caip . rutgers . edu ) , 1995
*/
# include <linux/time.h>
# include <linux/fs.h>
# include <linux/stat.h>
# include <linux/string.h>
# include <linux/quotaops.h>
# include <linux/buffer_head.h>
# include <linux/random.h>
# include <linux/bitops.h>
2006-10-11 01:21:05 -07:00
# include <linux/blkdev.h>
2017-02-02 17:54:15 +01:00
# include <linux/cred.h>
2006-10-11 01:20:50 -07:00
# include <asm/byteorder.h>
2009-06-17 11:48:11 -04:00
2008-04-29 18:13:32 -04:00
# include "ext4.h"
# include "ext4_jbd2.h"
2006-10-11 01:20:50 -07:00
# include "xattr.h"
# include "acl.h"
2009-06-17 11:48:11 -04:00
# include <trace/events/ext4.h>
2006-10-11 01:20:50 -07:00
/*
* ialloc . c contains the inodes allocation and deallocation routines
*/
/*
* The free inodes are managed by bitmaps . A file system contains several
* blocks groups . Each group contains 1 bitmap block for blocks , 1 bitmap
* block for inodes , N blocks for the inode table and data blocks .
*
* The file system contains group descriptors which are located after the
* super block . Each descriptor contains the number of the bitmap block and
* the free blocks count in the block .
*/
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
/*
* To avoid calling the atomic setbit hundreds or thousands of times , we only
* need to use it within a single byte ( to ensure we get endianness right ) .
* We can use memset for the rest of the bitmap as there are no other users .
*/
2010-10-27 21:30:15 -04:00
void ext4_mark_bitmap_end ( int start_bit , int end_bit , char * bitmap )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
{
int i ;
if ( start_bit > = end_bit )
return ;
ext4_debug ( " mark end bits +%d through +%d used \n " , start_bit , end_bit ) ;
for ( i = start_bit ; i < ( ( start_bit + 7 ) & ~ 7UL ) ; i + + )
ext4_set_bit ( i , bitmap ) ;
if ( i < end_bit )
memset ( bitmap + ( i > > 3 ) , 0xff , ( end_bit - i ) > > 3 ) ;
}
/* Initializes an uninitialized inode bitmap */
2015-10-17 21:33:24 -04:00
static int ext4_init_inode_bitmap ( struct super_block * sb ,
2010-10-27 21:30:14 -04:00
struct buffer_head * bh ,
ext4_group_t block_group ,
struct ext4_group_desc * gdp )
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
{
2013-08-28 18:46:56 -04:00
struct ext4_group_info * grp ;
2014-06-26 10:11:53 -04:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
J_ASSERT_BH ( bh , buffer_locked ( bh ) ) ;
/* If checksum is bad mark all blocks and inodes use to prevent
* allocation , essentially implementing a per - group read - only flag . */
2012-04-29 18:45:10 -04:00
if ( ! ext4_group_desc_csum_verify ( sb , block_group , gdp ) ) {
2013-08-28 18:46:56 -04:00
grp = ext4_get_group_info ( sb , block_group ) ;
2014-06-26 10:11:53 -04:00
if ( ! EXT4_MB_GRP_BBITMAP_CORRUPT ( grp ) )
percpu_counter_sub ( & sbi - > s_freeclusters_counter ,
grp - > bb_free ) ;
2013-08-28 18:46:56 -04:00
set_bit ( EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT , & grp - > bb_state ) ;
2014-06-26 10:11:53 -04:00
if ( ! EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) ) {
int count ;
count = ext4_free_inodes_count ( sb , gdp ) ;
percpu_counter_sub ( & sbi - > s_freeinodes_counter ,
count ) ;
}
2013-08-28 18:46:56 -04:00
set_bit ( EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT , & grp - > bb_state ) ;
2015-10-17 21:33:24 -04:00
return - EFSBADCRC ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
memset ( bh - > b_data , 0 , ( EXT4_INODES_PER_GROUP ( sb ) + 7 ) / 8 ) ;
2010-10-27 21:30:15 -04:00
ext4_mark_bitmap_end ( EXT4_INODES_PER_GROUP ( sb ) , sb - > s_blocksize * 8 ,
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
bh - > b_data ) ;
2012-04-29 18:33:10 -04:00
ext4_inode_bitmap_csum_set ( sb , block_group , gdp , bh ,
EXT4_INODES_PER_GROUP ( sb ) / 8 ) ;
2012-04-29 18:45:10 -04:00
ext4_group_desc_csum_set ( sb , block_group , gdp ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
2015-10-17 21:33:24 -04:00
return 0 ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
2006-10-11 01:20:50 -07:00
2012-02-20 17:52:46 -05:00
void ext4_end_bitmap_read ( struct buffer_head * bh , int uptodate )
{
if ( uptodate ) {
set_buffer_uptodate ( bh ) ;
set_bitmap_uptodate ( bh ) ;
}
unlock_buffer ( bh ) ;
put_bh ( bh ) ;
}
2015-10-17 21:33:24 -04:00
static int ext4_validate_inode_bitmap ( struct super_block * sb ,
struct ext4_group_desc * desc ,
ext4_group_t block_group ,
struct buffer_head * bh )
{
ext4_fsblk_t blk ;
struct ext4_group_info * grp = ext4_get_group_info ( sb , block_group ) ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
if ( buffer_verified ( bh ) )
return 0 ;
if ( EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) )
return - EFSCORRUPTED ;
ext4_lock_group ( sb , block_group ) ;
blk = ext4_inode_bitmap ( sb , desc ) ;
if ( ! ext4_inode_bitmap_csum_verify ( sb , block_group , desc , bh ,
EXT4_INODES_PER_GROUP ( sb ) / 8 ) ) {
ext4_unlock_group ( sb , block_group ) ;
ext4_error ( sb , " Corrupt inode bitmap - block_group = %u, "
" inode_bitmap = %llu " , block_group , blk ) ;
grp = ext4_get_group_info ( sb , block_group ) ;
if ( ! EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) ) {
int count ;
count = ext4_free_inodes_count ( sb , desc ) ;
percpu_counter_sub ( & sbi - > s_freeinodes_counter ,
count ) ;
}
set_bit ( EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT , & grp - > bb_state ) ;
return - EFSBADCRC ;
}
set_buffer_verified ( bh ) ;
ext4_unlock_group ( sb , block_group ) ;
return 0 ;
}
2006-10-11 01:20:50 -07:00
/*
* Read the inode allocation bitmap for a given block_group , reading
* into the specified slot in the superblock ' s bitmap cache .
*
* Return buffer_head of bitmap on success or NULL .
*/
static struct buffer_head *
2008-08-02 21:21:02 -04:00
ext4_read_inode_bitmap ( struct super_block * sb , ext4_group_t block_group )
2006-10-11 01:20:50 -07:00
{
2006-10-11 01:20:53 -07:00
struct ext4_group_desc * desc ;
2006-10-11 01:20:50 -07:00
struct buffer_head * bh = NULL ;
2008-08-02 21:21:02 -04:00
ext4_fsblk_t bitmap_blk ;
2015-10-17 21:33:24 -04:00
int err ;
2006-10-11 01:20:50 -07:00
2006-10-11 01:20:53 -07:00
desc = ext4_get_group_desc ( sb , block_group , NULL ) ;
2006-10-11 01:20:50 -07:00
if ( ! desc )
2015-10-17 21:33:24 -04:00
return ERR_PTR ( - EFSCORRUPTED ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
2008-08-02 21:21:02 -04:00
bitmap_blk = ext4_inode_bitmap ( sb , desc ) ;
bh = sb_getblk ( sb , bitmap_blk ) ;
if ( unlikely ( ! bh ) ) {
2010-02-15 14:19:27 -05:00
ext4_error ( sb , " Cannot read inode bitmap - "
2009-01-05 22:18:16 -05:00
" block_group = %u, inode_bitmap = %llu " ,
2008-08-02 21:21:02 -04:00
block_group , bitmap_blk ) ;
2015-10-17 21:33:24 -04:00
return ERR_PTR ( - EIO ) ;
2008-08-02 21:21:02 -04:00
}
2009-01-05 21:49:55 -05:00
if ( bitmap_uptodate ( bh ) )
2012-04-29 18:33:10 -04:00
goto verify ;
2008-08-02 21:21:02 -04:00
ext4: fix initialization of UNINIT bitmap blocks
This fixes a bug which caused on-line resizing of filesystems with a
1k blocksize to fail. The root cause of this bug was the fact that if
an uninitalized bitmap block gets read in by userspace (which
e2fsprogs does try to avoid, but can happen when the blocksize is less
than the pagesize and an adjacent blocks is read into memory)
ext4_read_block_bitmap() was erroneously depending on the buffer
uptodate flag to decide whether it needed to initialize the bitmap
block in memory --- i.e., to set the standard set of blocks in use by
a block group (superblock, bitmaps, inode table, etc.). Essentially,
ext4_read_block_bitmap() assumed it was the only routine that might
try to read a block containing a block bitmap, which is simply not
true.
To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap()
must always initialize uninitialized bitmap blocks. Once a block or
inode is allocated out of that bitmap, it will be marked as
initialized in the block group descriptor, so in general this won't
result any extra unnecessary work.
Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 08:09:18 -04:00
lock_buffer ( bh ) ;
2009-01-05 21:49:55 -05:00
if ( bitmap_uptodate ( bh ) ) {
unlock_buffer ( bh ) ;
2012-04-29 18:33:10 -04:00
goto verify ;
2009-01-05 21:49:55 -05:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
2009-05-02 20:35:09 -04:00
ext4_lock_group ( sb , block_group ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
if ( desc - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_UNINIT ) ) {
2015-10-17 21:33:24 -04:00
err = ext4_init_inode_bitmap ( sb , bh , block_group , desc ) ;
2009-01-05 21:49:55 -05:00
set_bitmap_uptodate ( bh ) ;
2008-08-02 21:21:02 -04:00
set_buffer_uptodate ( bh ) ;
2012-04-29 18:33:10 -04:00
set_buffer_verified ( bh ) ;
2009-05-02 20:35:09 -04:00
ext4_unlock_group ( sb , block_group ) ;
2009-01-03 22:33:39 -05:00
unlock_buffer ( bh ) ;
2016-02-11 23:15:12 -05:00
if ( err ) {
ext4_error ( sb , " Failed to init inode bitmap for group "
" %u: %d " , block_group , err ) ;
2015-10-17 21:33:24 -04:00
goto out ;
2016-02-11 23:15:12 -05:00
}
2008-08-02 21:21:02 -04:00
return bh ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
2009-05-02 20:35:09 -04:00
ext4_unlock_group ( sb , block_group ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
2009-01-05 21:49:55 -05:00
if ( buffer_uptodate ( bh ) ) {
/*
* if not uninit if bh is uptodate ,
* bitmap is also uptodate
*/
set_bitmap_uptodate ( bh ) ;
unlock_buffer ( bh ) ;
2012-04-29 18:33:10 -04:00
goto verify ;
2009-01-05 21:49:55 -05:00
}
/*
2012-02-20 17:52:46 -05:00
* submit the buffer_head for reading
2009-01-05 21:49:55 -05:00
*/
2011-03-21 21:38:05 -04:00
trace_ext4_load_inode_bitmap ( sb , block_group ) ;
2012-02-20 17:52:46 -05:00
bh - > b_end_io = ext4_end_bitmap_read ;
get_bh ( bh ) ;
2016-06-05 14:31:43 -05:00
submit_bh ( REQ_OP_READ , REQ_META | REQ_PRIO , bh ) ;
2012-02-20 17:52:46 -05:00
wait_on_buffer ( bh ) ;
if ( ! buffer_uptodate ( bh ) ) {
2008-08-02 21:21:02 -04:00
put_bh ( bh ) ;
2010-02-15 14:19:27 -05:00
ext4_error ( sb , " Cannot read inode bitmap - "
2012-02-20 17:52:46 -05:00
" block_group = %u, inode_bitmap = %llu " ,
block_group , bitmap_blk ) ;
2015-10-17 21:33:24 -04:00
return ERR_PTR ( - EIO ) ;
2008-08-02 21:21:02 -04:00
}
2012-04-29 18:33:10 -04:00
verify :
2015-10-17 21:33:24 -04:00
err = ext4_validate_inode_bitmap ( sb , desc , block_group , bh ) ;
if ( err )
goto out ;
2006-10-11 01:20:50 -07:00
return bh ;
2015-10-17 21:33:24 -04:00
out :
put_bh ( bh ) ;
return ERR_PTR ( err ) ;
2006-10-11 01:20:50 -07:00
}
/*
* NOTE ! When we get the inode , we ' re the only people
* that have access to it , and as such there are no
* race conditions we have to worry about . The inode
* is not on the hash - lists , and it cannot be reached
* through the filesystem because the directory entry
* has been deleted earlier .
*
* HOWEVER : we must make sure that we get no aliases ,
* which means that we have to call " clear_inode() "
* _before_ we mark the inode not in use in the inode
* bitmaps . Otherwise a newly created file might use
* the same inode number ( not actually the same pointer
* though ) , and then we ' d have two inodes sharing the
* same inode number and space on the harddisk .
*/
2008-09-08 22:25:24 -04:00
void ext4_free_inode ( handle_t * handle , struct inode * inode )
2006-10-11 01:20:50 -07:00
{
2008-09-08 22:25:24 -04:00
struct super_block * sb = inode - > i_sb ;
2006-10-11 01:20:50 -07:00
int is_directory ;
unsigned long ino ;
struct buffer_head * bitmap_bh = NULL ;
struct buffer_head * bh2 ;
2008-01-28 23:58:27 -05:00
ext4_group_t block_group ;
2006-10-11 01:20:50 -07:00
unsigned long bit ;
2008-09-08 22:25:24 -04:00
struct ext4_group_desc * gdp ;
struct ext4_super_block * es ;
2006-10-11 01:20:53 -07:00
struct ext4_sb_info * sbi ;
2009-03-04 18:38:18 -05:00
int fatal = 0 , err , count , cleared ;
2013-08-28 18:32:58 -04:00
struct ext4_group_info * grp ;
2006-10-11 01:20:50 -07:00
2012-03-19 23:41:49 -04:00
if ( ! sb ) {
printk ( KERN_ERR " EXT4-fs: %s:%d: inode on "
" nonexistent device \n " , __func__ , __LINE__ ) ;
2006-10-11 01:20:50 -07:00
return ;
}
2012-03-19 23:41:49 -04:00
if ( atomic_read ( & inode - > i_count ) > 1 ) {
ext4_msg ( sb , KERN_ERR , " %s:%d: inode #%lu: count=%d " ,
__func__ , __LINE__ , inode - > i_ino ,
atomic_read ( & inode - > i_count ) ) ;
2006-10-11 01:20:50 -07:00
return ;
}
2012-03-19 23:41:49 -04:00
if ( inode - > i_nlink ) {
ext4_msg ( sb , KERN_ERR , " %s:%d: inode #%lu: nlink=%d \n " ,
__func__ , __LINE__ , inode - > i_ino , inode - > i_nlink ) ;
2006-10-11 01:20:50 -07:00
return ;
}
2006-10-11 01:20:53 -07:00
sbi = EXT4_SB ( sb ) ;
2006-10-11 01:20:50 -07:00
ino = inode - > i_ino ;
2008-09-08 22:25:24 -04:00
ext4_debug ( " freeing inode %lu \n " , ino ) ;
2009-06-17 11:48:11 -04:00
trace_ext4_free_inode ( inode ) ;
2006-10-11 01:20:50 -07:00
/*
* Note : we must free any quota before locking the superblock ,
* as writing the quota to disk may need the lock as well .
*/
2010-03-03 09:05:07 -05:00
dquot_initialize ( inode ) ;
2010-03-03 09:05:01 -05:00
dquot_free_inode ( inode ) ;
2010-03-03 09:05:05 -05:00
dquot_drop ( inode ) ;
2006-10-11 01:20:50 -07:00
is_directory = S_ISDIR ( inode - > i_mode ) ;
/* Do this BEFORE marking the inode not in use or returning an error */
2010-06-07 13:16:22 -04:00
ext4_clear_inode ( inode ) ;
2006-10-11 01:20:50 -07:00
2006-10-11 01:20:53 -07:00
es = EXT4_SB ( sb ) - > s_es ;
if ( ino < EXT4_FIRST_INO ( sb ) | | ino > le32_to_cpu ( es - > s_inodes_count ) ) {
2010-02-15 14:19:27 -05:00
ext4_error ( sb , " reserved or nonexistent inode %lu " , ino ) ;
2006-10-11 01:20:50 -07:00
goto error_return ;
}
2006-10-11 01:20:53 -07:00
block_group = ( ino - 1 ) / EXT4_INODES_PER_GROUP ( sb ) ;
bit = ( ino - 1 ) % EXT4_INODES_PER_GROUP ( sb ) ;
2008-08-02 21:21:02 -04:00
bitmap_bh = ext4_read_inode_bitmap ( sb , block_group ) ;
2013-08-28 18:32:58 -04:00
/* Don't bother if the inode bitmap is corrupt. */
grp = ext4_get_group_info ( sb , block_group ) ;
2015-10-17 21:33:24 -04:00
if ( IS_ERR ( bitmap_bh ) ) {
fatal = PTR_ERR ( bitmap_bh ) ;
bitmap_bh = NULL ;
goto error_return ;
}
if ( unlikely ( EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) ) ) {
fatal = - EFSCORRUPTED ;
2006-10-11 01:20:50 -07:00
goto error_return ;
2015-10-17 21:33:24 -04:00
}
2006-10-11 01:20:50 -07:00
BUFFER_TRACE ( bitmap_bh , " get_write_access " ) ;
2006-10-11 01:20:53 -07:00
fatal = ext4_journal_get_write_access ( handle , bitmap_bh ) ;
2006-10-11 01:20:50 -07:00
if ( fatal )
goto error_return ;
2010-05-16 07:00:00 -04:00
fatal = - ESRCH ;
gdp = ext4_get_group_desc ( sb , block_group , & bh2 ) ;
if ( gdp ) {
2006-10-11 01:20:50 -07:00
BUFFER_TRACE ( bh2 , " get_write_access " ) ;
2006-10-11 01:20:53 -07:00
fatal = ext4_journal_get_write_access ( handle , bh2 ) ;
2010-05-16 07:00:00 -04:00
}
ext4_lock_group ( sb , block_group ) ;
ext4: use proper little-endian bitops
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4. Only two ext4_{set,clear}_bit() calls check the return value. The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().
This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.
This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.
This also removes unused ext4_find_first_zero_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-28 20:32:07 -05:00
cleared = ext4_test_and_clear_bit ( bit , bitmap_bh - > b_data ) ;
2010-05-16 07:00:00 -04:00
if ( fatal | | ! cleared ) {
ext4_unlock_group ( sb , block_group ) ;
goto out ;
}
2009-03-04 19:31:53 -05:00
2010-05-16 07:00:00 -04:00
count = ext4_free_inodes_count ( sb , gdp ) + 1 ;
ext4_free_inodes_set ( sb , gdp , count ) ;
if ( is_directory ) {
count = ext4_used_dirs_count ( sb , gdp ) - 1 ;
ext4_used_dirs_set ( sb , gdp , count ) ;
percpu_counter_dec ( & sbi - > s_dirs_counter ) ;
2006-10-11 01:20:50 -07:00
}
2012-04-29 18:33:10 -04:00
ext4_inode_bitmap_csum_set ( sb , block_group , gdp , bitmap_bh ,
EXT4_INODES_PER_GROUP ( sb ) / 8 ) ;
2012-04-29 18:45:10 -04:00
ext4_group_desc_csum_set ( sb , block_group , gdp ) ;
2010-05-16 07:00:00 -04:00
ext4_unlock_group ( sb , block_group ) ;
2006-10-11 01:20:50 -07:00
2010-05-16 07:00:00 -04:00
percpu_counter_inc ( & sbi - > s_freeinodes_counter ) ;
if ( sbi - > s_log_groups_per_flex ) {
ext4_group_t f = ext4_flex_group ( sbi , block_group ) ;
2009-03-04 19:09:10 -05:00
2010-05-16 07:00:00 -04:00
atomic_inc ( & sbi - > s_flex_groups [ f ] . free_inodes ) ;
if ( is_directory )
atomic_dec ( & sbi - > s_flex_groups [ f ] . used_dirs ) ;
2006-10-11 01:20:50 -07:00
}
2010-05-16 07:00:00 -04:00
BUFFER_TRACE ( bh2 , " call ext4_handle_dirty_metadata " ) ;
fatal = ext4_handle_dirty_metadata ( handle , NULL , bh2 ) ;
out :
if ( cleared ) {
BUFFER_TRACE ( bitmap_bh , " call ext4_handle_dirty_metadata " ) ;
err = ext4_handle_dirty_metadata ( handle , NULL , bitmap_bh ) ;
if ( ! fatal )
fatal = err ;
2013-08-28 18:32:58 -04:00
} else {
2010-05-16 07:00:00 -04:00
ext4_error ( sb , " bit already cleared for inode %lu " , ino ) ;
2014-07-12 16:11:42 -04:00
if ( gdp & & ! EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) ) {
2014-06-26 10:11:53 -04:00
int count ;
count = ext4_free_inodes_count ( sb , gdp ) ;
percpu_counter_sub ( & sbi - > s_freeinodes_counter ,
count ) ;
}
2013-08-28 18:32:58 -04:00
set_bit ( EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT , & grp - > bb_state ) ;
}
2010-05-16 07:00:00 -04:00
2006-10-11 01:20:50 -07:00
error_return :
brelse ( bitmap_bh ) ;
2006-10-11 01:20:53 -07:00
ext4_std_error ( sb , fatal ) ;
2006-10-11 01:20:50 -07:00
}
2009-03-12 12:18:34 -04:00
struct orlov_stats {
2013-03-11 23:39:59 -04:00
__u64 free_clusters ;
2009-03-12 12:18:34 -04:00
__u32 free_inodes ;
__u32 used_dirs ;
} ;
/*
* Helper function for Orlov ' s allocator ; returns critical information
* for a particular block group or flex_bg . If flex_size is 1 , then g
* is a block group number ; otherwise it is flex_bg number .
*/
2010-10-27 21:30:14 -04:00
static void get_orlov_stats ( struct super_block * sb , ext4_group_t g ,
int flex_size , struct orlov_stats * stats )
2009-03-12 12:18:34 -04:00
{
struct ext4_group_desc * desc ;
2009-03-04 19:31:53 -05:00
struct flex_groups * flex_group = EXT4_SB ( sb ) - > s_flex_groups ;
2009-03-12 12:18:34 -04:00
2009-03-04 19:31:53 -05:00
if ( flex_size > 1 ) {
stats - > free_inodes = atomic_read ( & flex_group [ g ] . free_inodes ) ;
2013-03-11 23:39:59 -04:00
stats - > free_clusters = atomic64_read ( & flex_group [ g ] . free_clusters ) ;
2009-03-04 19:31:53 -05:00
stats - > used_dirs = atomic_read ( & flex_group [ g ] . used_dirs ) ;
return ;
}
2009-03-12 12:18:34 -04:00
2009-03-04 19:31:53 -05:00
desc = ext4_get_group_desc ( sb , g , NULL ) ;
if ( desc ) {
stats - > free_inodes = ext4_free_inodes_count ( sb , desc ) ;
2011-09-09 19:08:51 -04:00
stats - > free_clusters = ext4_free_group_clusters ( sb , desc ) ;
2009-03-04 19:31:53 -05:00
stats - > used_dirs = ext4_used_dirs_count ( sb , desc ) ;
} else {
stats - > free_inodes = 0 ;
2011-09-09 18:58:51 -04:00
stats - > free_clusters = 0 ;
2009-03-04 19:31:53 -05:00
stats - > used_dirs = 0 ;
2009-03-12 12:18:34 -04:00
}
}
2006-10-11 01:20:50 -07:00
/*
* Orlov ' s allocator for directories .
*
* We always try to spread first - level directories .
*
* If there are blockgroups with both free inodes and free blocks counts
* not worse than average we return one with smallest directory count .
* Otherwise we simply return a random group .
*
* For the rest rules look so :
*
* It ' s OK to put directory into a group unless
* it has too many directories already ( max_dirs ) or
* it has too few free inodes left ( min_inodes ) or
* it has too few free blocks left ( min_blocks ) or
2008-04-21 22:45:55 +00:00
* Parent ' s group is preferred , if it doesn ' t satisfy these
2006-10-11 01:20:50 -07:00
* conditions we search cyclically through the rest . If none
* of the groups look good we just look for a group with more
* free inodes than average ( starting at parent ' s group ) .
*/
2008-01-28 23:58:27 -05:00
static int find_group_orlov ( struct super_block * sb , struct inode * parent ,
2011-07-26 02:48:06 -04:00
ext4_group_t * group , umode_t mode ,
2009-06-13 11:09:42 -04:00
const struct qstr * qstr )
2006-10-11 01:20:50 -07:00
{
2008-01-28 23:58:27 -05:00
ext4_group_t parent_group = EXT4_I ( parent ) - > i_block_group ;
2006-10-11 01:20:53 -07:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2009-05-01 08:50:38 -04:00
ext4_group_t real_ngroups = ext4_get_groups_count ( sb ) ;
2006-10-11 01:20:53 -07:00
int inodes_per_group = EXT4_INODES_PER_GROUP ( sb ) ;
2011-12-28 20:25:13 -05:00
unsigned int freei , avefreei , grp_free ;
2011-09-09 18:58:51 -04:00
ext4_fsblk_t freeb , avefreec ;
2006-10-11 01:20:50 -07:00
unsigned int ndirs ;
2009-03-12 12:18:34 -04:00
int max_dirs , min_inodes ;
2011-09-09 18:58:51 -04:00
ext4_grpblk_t min_clusters ;
2009-05-01 08:50:38 -04:00
ext4_group_t i , grp , g , ngroups ;
2006-10-11 01:20:53 -07:00
struct ext4_group_desc * desc ;
2009-03-12 12:18:34 -04:00
struct orlov_stats stats ;
int flex_size = ext4_flex_bg_size ( sbi ) ;
2009-06-13 11:09:42 -04:00
struct dx_hash_info hinfo ;
2009-03-12 12:18:34 -04:00
2009-05-01 08:50:38 -04:00
ngroups = real_ngroups ;
2009-03-12 12:18:34 -04:00
if ( flex_size > 1 ) {
2009-05-01 08:50:38 -04:00
ngroups = ( real_ngroups + flex_size - 1 ) > >
2009-03-12 12:18:34 -04:00
sbi - > s_log_groups_per_flex ;
parent_group > > = sbi - > s_log_groups_per_flex ;
}
2006-10-11 01:20:50 -07:00
freei = percpu_counter_read_positive ( & sbi - > s_freeinodes_counter ) ;
avefreei = freei / ngroups ;
2011-09-09 18:56:51 -04:00
freeb = EXT4_C2B ( sbi ,
percpu_counter_read_positive ( & sbi - > s_freeclusters_counter ) ) ;
2011-09-09 18:58:51 -04:00
avefreec = freeb ;
do_div ( avefreec , ngroups ) ;
2006-10-11 01:20:50 -07:00
ndirs = percpu_counter_read_positive ( & sbi - > s_dirs_counter ) ;
2009-03-12 12:18:34 -04:00
if ( S_ISDIR ( mode ) & &
2015-03-17 22:25:59 +00:00
( ( parent = = d_inode ( sb - > s_root ) ) | |
2010-05-16 22:00:00 -04:00
( ext4_test_inode_flag ( parent , EXT4_INODE_TOPDIR ) ) ) ) {
2006-10-11 01:20:50 -07:00
int best_ndir = inodes_per_group ;
2008-01-28 23:58:27 -05:00
int ret = - 1 ;
2006-10-11 01:20:50 -07:00
2009-06-13 11:09:42 -04:00
if ( qstr ) {
hinfo . hash_version = DX_HASH_HALF_MD4 ;
hinfo . seed = sbi - > s_hash_seed ;
ext4fs_dirhash ( qstr - > name , qstr - > len , & hinfo ) ;
grp = hinfo . hash ;
} else
2013-11-08 00:14:53 -05:00
grp = prandom_u32 ( ) ;
2008-01-28 23:58:27 -05:00
parent_group = ( unsigned ) grp % ngroups ;
2006-10-11 01:20:50 -07:00
for ( i = 0 ; i < ngroups ; i + + ) {
2009-03-12 12:18:34 -04:00
g = ( parent_group + i ) % ngroups ;
get_orlov_stats ( sb , g , flex_size , & stats ) ;
if ( ! stats . free_inodes )
2006-10-11 01:20:50 -07:00
continue ;
2009-03-12 12:18:34 -04:00
if ( stats . used_dirs > = best_ndir )
2006-10-11 01:20:50 -07:00
continue ;
2009-03-12 12:18:34 -04:00
if ( stats . free_inodes < avefreei )
2006-10-11 01:20:50 -07:00
continue ;
2011-09-09 18:58:51 -04:00
if ( stats . free_clusters < avefreec )
2006-10-11 01:20:50 -07:00
continue ;
2009-03-12 12:18:34 -04:00
grp = g ;
2008-01-28 23:58:27 -05:00
ret = 0 ;
2009-03-12 12:18:34 -04:00
best_ndir = stats . used_dirs ;
}
if ( ret )
goto fallback ;
found_flex_bg :
if ( flex_size = = 1 ) {
* group = grp ;
return 0 ;
}
/*
* We pack inodes at the beginning of the flexgroup ' s
* inode tables . Block allocation decisions will do
* something similar , although regular files will
* start at 2 nd block group of the flexgroup . See
* ext4_ext_find_goal ( ) and ext4_find_near ( ) .
*/
grp * = flex_size ;
for ( i = 0 ; i < flex_size ; i + + ) {
2009-05-01 08:50:38 -04:00
if ( grp + i > = real_ngroups )
2009-03-12 12:18:34 -04:00
break ;
desc = ext4_get_group_desc ( sb , grp + i , NULL ) ;
if ( desc & & ext4_free_inodes_count ( sb , desc ) ) {
* group = grp + i ;
return 0 ;
}
2006-10-11 01:20:50 -07:00
}
goto fallback ;
}
max_dirs = ndirs / ngroups + inodes_per_group / 16 ;
2009-03-12 12:18:34 -04:00
min_inodes = avefreei - inodes_per_group * flex_size / 4 ;
if ( min_inodes < 1 )
min_inodes = 1 ;
2011-09-09 18:58:51 -04:00
min_clusters = avefreec - EXT4_CLUSTERS_PER_GROUP ( sb ) * flex_size / 4 ;
2009-03-12 12:18:34 -04:00
/*
* Start looking in the flex group where we last allocated an
* inode for this parent directory
*/
if ( EXT4_I ( parent ) - > i_last_alloc_group ! = ~ 0 ) {
parent_group = EXT4_I ( parent ) - > i_last_alloc_group ;
if ( flex_size > 1 )
parent_group > > = sbi - > s_log_groups_per_flex ;
}
2006-10-11 01:20:50 -07:00
for ( i = 0 ; i < ngroups ; i + + ) {
2009-03-12 12:18:34 -04:00
grp = ( parent_group + i ) % ngroups ;
get_orlov_stats ( sb , grp , flex_size , & stats ) ;
if ( stats . used_dirs > = max_dirs )
2006-10-11 01:20:50 -07:00
continue ;
2009-03-12 12:18:34 -04:00
if ( stats . free_inodes < min_inodes )
2006-10-11 01:20:50 -07:00
continue ;
2011-09-09 18:58:51 -04:00
if ( stats . free_clusters < min_clusters )
2006-10-11 01:20:50 -07:00
continue ;
2009-03-12 12:18:34 -04:00
goto found_flex_bg ;
2006-10-11 01:20:50 -07:00
}
fallback :
2009-05-01 08:50:38 -04:00
ngroups = real_ngroups ;
2009-03-12 12:18:34 -04:00
avefreei = freei / ngroups ;
2009-04-22 21:00:36 -04:00
fallback_retry :
2009-03-12 12:18:34 -04:00
parent_group = EXT4_I ( parent ) - > i_block_group ;
2006-10-11 01:20:50 -07:00
for ( i = 0 ; i < ngroups ; i + + ) {
2009-03-12 12:18:34 -04:00
grp = ( parent_group + i ) % ngroups ;
desc = ext4_get_group_desc ( sb , grp , NULL ) ;
2012-05-28 14:16:57 -04:00
if ( desc ) {
grp_free = ext4_free_inodes_count ( sb , desc ) ;
if ( grp_free & & grp_free > = avefreei ) {
* group = grp ;
return 0 ;
}
2009-03-12 12:18:34 -04:00
}
2006-10-11 01:20:50 -07:00
}
if ( avefreei ) {
/*
* The free - inodes counter is approximate , and for really small
* filesystems the above test can fail to find any blockgroups
*/
avefreei = 0 ;
2009-04-22 21:00:36 -04:00
goto fallback_retry ;
2006-10-11 01:20:50 -07:00
}
return - 1 ;
}
2008-01-28 23:58:27 -05:00
static int find_group_other ( struct super_block * sb , struct inode * parent ,
2011-07-26 02:48:06 -04:00
ext4_group_t * group , umode_t mode )
2006-10-11 01:20:50 -07:00
{
2008-01-28 23:58:27 -05:00
ext4_group_t parent_group = EXT4_I ( parent ) - > i_block_group ;
2009-05-01 08:50:38 -04:00
ext4_group_t i , last , ngroups = ext4_get_groups_count ( sb ) ;
2006-10-11 01:20:53 -07:00
struct ext4_group_desc * desc ;
2009-03-12 12:18:34 -04:00
int flex_size = ext4_flex_bg_size ( EXT4_SB ( sb ) ) ;
/*
* Try to place the inode is the same flex group as its
* parent . If we can ' t find space , use the Orlov algorithm to
* find another flex group , and store that information in the
* parent directory ' s inode information so that use that flex
* group for future allocations .
*/
if ( flex_size > 1 ) {
int retry = 0 ;
try_again :
parent_group & = ~ ( flex_size - 1 ) ;
last = parent_group + flex_size ;
if ( last > ngroups )
last = ngroups ;
for ( i = parent_group ; i < last ; i + + ) {
desc = ext4_get_group_desc ( sb , i , NULL ) ;
if ( desc & & ext4_free_inodes_count ( sb , desc ) ) {
* group = i ;
return 0 ;
}
}
if ( ! retry & & EXT4_I ( parent ) - > i_last_alloc_group ! = ~ 0 ) {
retry = 1 ;
parent_group = EXT4_I ( parent ) - > i_last_alloc_group ;
goto try_again ;
}
/*
* If this didn ' t work , use the Orlov search algorithm
* to find a new flex group ; we pass in the mode to
* avoid the topdir algorithms .
*/
* group = parent_group + flex_size ;
if ( * group > ngroups )
* group = 0 ;
2011-02-21 21:01:42 -05:00
return find_group_orlov ( sb , parent , group , mode , NULL ) ;
2009-03-12 12:18:34 -04:00
}
2006-10-11 01:20:50 -07:00
/*
* Try to place the inode in its parent directory
*/
2008-01-28 23:58:27 -05:00
* group = parent_group ;
desc = ext4_get_group_desc ( sb , * group , NULL ) ;
2009-01-05 22:20:24 -05:00
if ( desc & & ext4_free_inodes_count ( sb , desc ) & &
2011-09-09 19:08:51 -04:00
ext4_free_group_clusters ( sb , desc ) )
2008-01-28 23:58:27 -05:00
return 0 ;
2006-10-11 01:20:50 -07:00
/*
* We ' re going to place this inode in a different blockgroup from its
* parent . We want to cause files in a common directory to all land in
* the same blockgroup . But we want files which are in a different
* directory which shares a blockgroup with our parent to land in a
* different blockgroup .
*
* So add our directory ' s i_ino into the starting point for the hash .
*/
2008-01-28 23:58:27 -05:00
* group = ( * group + parent - > i_ino ) % ngroups ;
2006-10-11 01:20:50 -07:00
/*
* Use a quadratic hash to find a group with a free inode and some free
* blocks .
*/
for ( i = 1 ; i < ngroups ; i < < = 1 ) {
2008-01-28 23:58:27 -05:00
* group + = i ;
if ( * group > = ngroups )
* group - = ngroups ;
desc = ext4_get_group_desc ( sb , * group , NULL ) ;
2009-01-05 22:20:24 -05:00
if ( desc & & ext4_free_inodes_count ( sb , desc ) & &
2011-09-09 19:08:51 -04:00
ext4_free_group_clusters ( sb , desc ) )
2008-01-28 23:58:27 -05:00
return 0 ;
2006-10-11 01:20:50 -07:00
}
/*
* That failed : try linear search for a free inode , even if that group
* has no free blocks .
*/
2008-01-28 23:58:27 -05:00
* group = parent_group ;
2006-10-11 01:20:50 -07:00
for ( i = 0 ; i < ngroups ; i + + ) {
2008-01-28 23:58:27 -05:00
if ( + + * group > = ngroups )
* group = 0 ;
desc = ext4_get_group_desc ( sb , * group , NULL ) ;
2009-01-05 22:20:24 -05:00
if ( desc & & ext4_free_inodes_count ( sb , desc ) )
2008-01-28 23:58:27 -05:00
return 0 ;
2006-10-11 01:20:50 -07:00
}
return - 1 ;
}
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
/*
* In no journal mode , if an inode has recently been deleted , we want
* to avoid reusing it until we ' re reasonably sure the inode table
* block has been written back to disk . ( Yes , these values are
* somewhat arbitrary . . . )
*/
# define RECENTCY_MIN 5
# define RECENTCY_DIRTY 30
static int recently_deleted ( struct super_block * sb , ext4_group_t group , int ino )
{
struct ext4_group_desc * gdp ;
struct ext4_inode * raw_inode ;
struct buffer_head * bh ;
unsigned long dtime , now ;
int inodes_per_block = EXT4_SB ( sb ) - > s_inodes_per_block ;
int offset , ret = 0 , recentcy = RECENTCY_MIN ;
gdp = ext4_get_group_desc ( sb , group , NULL ) ;
if ( unlikely ( ! gdp ) )
return 0 ;
bh = sb_getblk ( sb , ext4_inode_table ( sb , gdp ) +
( ino / inodes_per_block ) ) ;
if ( unlikely ( ! bh ) | | ! buffer_uptodate ( bh ) )
/*
* If the block is not in the buffer cache , then it
* must have been written out .
*/
goto out ;
offset = ( ino % inodes_per_block ) * EXT4_INODE_SIZE ( sb ) ;
raw_inode = ( struct ext4_inode * ) ( bh - > b_data + offset ) ;
dtime = le32_to_cpu ( raw_inode - > i_dtime ) ;
now = get_seconds ( ) ;
if ( buffer_dirty ( bh ) )
recentcy + = RECENTCY_DIRTY ;
if ( dtime & & ( dtime < now ) & & ( now < dtime + recentcy ) )
ret = 1 ;
out :
brelse ( bh ) ;
return ret ;
}
2006-10-11 01:20:50 -07:00
/*
* There are two policies for allocating an inode . If the new inode is
* a directory , then a forward search is made for a block group with both
* free space and a low directory - to - inode ratio ; if that fails , then of
* the groups with above - average free space , that group with the fewest
* directories already is chosen .
*
* For other inodes , search forward from the parent directory ' s block
* group to find a free inode .
*/
2013-02-09 16:27:09 -05:00
struct inode * __ext4_new_inode ( handle_t * handle , struct inode * dir ,
umode_t mode , const struct qstr * qstr ,
2017-06-21 21:21:39 -04:00
__u32 goal , uid_t * owner , __u32 i_flags ,
int handle_type , unsigned int line_no ,
int nblocks )
2006-10-11 01:20:50 -07:00
{
struct super_block * sb ;
2009-01-03 22:33:39 -05:00
struct buffer_head * inode_bitmap_bh = NULL ;
struct buffer_head * group_desc_bh ;
2009-05-01 08:50:38 -04:00
ext4_group_t ngroups , group = 0 ;
2006-10-11 01:20:50 -07:00
unsigned long ino = 0 ;
2008-09-08 22:25:24 -04:00
struct inode * inode ;
struct ext4_group_desc * gdp = NULL ;
2006-10-11 01:20:53 -07:00
struct ext4_inode_info * ei ;
struct ext4_sb_info * sbi ;
2015-06-29 16:22:54 +02:00
int ret2 , err ;
2006-10-11 01:20:50 -07:00
struct inode * ret ;
2008-01-28 23:58:27 -05:00
ext4_group_t i ;
2008-07-11 19:27:31 -04:00
ext4_group_t flex_group ;
2013-08-28 18:32:58 -04:00
struct ext4_group_info * grp ;
2015-05-31 13:35:02 -04:00
int encrypt = 0 ;
2006-10-11 01:20:50 -07:00
/* Cannot create files in a deleted directory */
if ( ! dir | | ! dir - > i_nlink )
return ERR_PTR ( - EPERM ) ;
2017-07-06 00:01:59 -04:00
sb = dir - > i_sb ;
sbi = EXT4_SB ( sb ) ;
if ( unlikely ( ext4_forced_shutdown ( sbi ) ) )
2017-02-05 01:28:48 -05:00
return ERR_PTR ( - EIO ) ;
2017-07-06 00:01:59 -04:00
if ( ( ext4_encrypted_inode ( dir ) | | DUMMY_ENCRYPTION_ENABLED ( sbi ) ) & &
2017-07-06 00:00:59 -04:00
( S_ISREG ( mode ) | | S_ISDIR ( mode ) | | S_ISLNK ( mode ) ) & &
! ( i_flags & EXT4_EA_INODE_FL ) ) {
2016-07-10 14:01:03 -04:00
err = fscrypt_get_encryption_info ( dir ) ;
2015-05-31 13:35:02 -04:00
if ( err )
return ERR_PTR ( err ) ;
2016-07-10 14:01:03 -04:00
if ( ! fscrypt_has_encryption_key ( dir ) )
2016-12-05 11:12:44 -08:00
return ERR_PTR ( - ENOKEY ) ;
2015-05-31 13:35:02 -04:00
encrypt = 1 ;
}
2017-07-06 00:01:59 -04:00
if ( ! handle & & sbi - > s_journal & & ! ( i_flags & EXT4_EA_INODE_FL ) ) {
# ifdef CONFIG_EXT4_FS_POSIX_ACL
struct posix_acl * p = get_acl ( dir , ACL_TYPE_DEFAULT ) ;
if ( p ) {
int acl_size = p - > a_count * sizeof ( ext4_acl_entry ) ;
nblocks + = ( S_ISDIR ( mode ) ? 2 : 1 ) *
__ext4_xattr_set_credits ( sb , NULL /* inode */ ,
NULL /* block_bh */ , acl_size ,
true /* is_create */ ) ;
posix_acl_release ( p ) ;
}
# endif
# ifdef CONFIG_SECURITY
{
int num_security_xattrs = 1 ;
# ifdef CONFIG_INTEGRITY
num_security_xattrs + + ;
# endif
/*
* We assume that security xattrs are never
* more than 1 k . In practice they are under
* 128 bytes .
*/
nblocks + = num_security_xattrs *
__ext4_xattr_set_credits ( sb , NULL /* inode */ ,
NULL /* block_bh */ , 1024 ,
true /* is_create */ ) ;
}
# endif
if ( encrypt )
nblocks + = __ext4_xattr_set_credits ( sb ,
NULL /* inode */ , NULL /* block_bh */ ,
FSCRYPT_SET_CONTEXT_MAX_SIZE ,
true /* is_create */ ) ;
}
2009-05-01 08:50:38 -04:00
ngroups = ext4_get_groups_count ( sb ) ;
2009-06-17 11:48:11 -04:00
trace_ext4_request_inode ( dir , mode ) ;
2006-10-11 01:20:50 -07:00
inode = new_inode ( sb ) ;
if ( ! inode )
return ERR_PTR ( - ENOMEM ) ;
2006-10-11 01:20:53 -07:00
ei = EXT4_I ( inode ) ;
2008-07-11 19:27:31 -04:00
2013-04-19 13:38:14 -04:00
/*
2016-03-09 23:49:05 -05:00
* Initialize owners and quota early so that we don ' t have to account
2013-04-19 13:38:14 -04:00
* for quota initialization worst case in standard inode creating
* transaction
*/
if ( owner ) {
inode - > i_mode = mode ;
i_uid_write ( inode , owner [ 0 ] ) ;
i_gid_write ( inode , owner [ 1 ] ) ;
} else if ( test_opt ( sb , GRPID ) ) {
inode - > i_mode = mode ;
inode - > i_uid = current_fsuid ( ) ;
inode - > i_gid = dir - > i_gid ;
} else
inode_init_owner ( inode , dir , mode ) ;
2016-01-08 16:01:21 -05:00
2016-09-05 23:11:58 -04:00
if ( ext4_has_feature_project ( sb ) & &
2016-01-08 16:01:21 -05:00
ext4_test_inode_flag ( dir , EXT4_INODE_PROJINHERIT ) )
ei - > i_projid = EXT4_I ( dir ) - > i_projid ;
else
ei - > i_projid = make_kprojid ( & init_user_ns , EXT4_DEF_PROJID ) ;
2015-06-29 16:22:54 +02:00
err = dquot_initialize ( inode ) ;
if ( err )
goto out ;
2013-04-19 13:38:14 -04:00
2009-06-13 11:45:35 -04:00
if ( ! goal )
goal = sbi - > s_inode_goal ;
2009-07-05 23:45:11 -04:00
if ( goal & & goal < = le32_to_cpu ( sbi - > s_es - > s_inodes_count ) ) {
2009-06-13 11:45:35 -04:00
group = ( goal - 1 ) / EXT4_INODES_PER_GROUP ( sb ) ;
ino = ( goal - 1 ) % EXT4_INODES_PER_GROUP ( sb ) ;
ret2 = 0 ;
goto got_group ;
}
2011-10-08 14:34:47 -04:00
if ( S_ISDIR ( mode ) )
ret2 = find_group_orlov ( sb , dir , & group , mode , qstr ) ;
else
2009-03-12 12:18:34 -04:00
ret2 = find_group_other ( sb , dir , & group , mode ) ;
2006-10-11 01:20:50 -07:00
2008-07-11 19:27:31 -04:00
got_group :
2009-03-12 12:18:34 -04:00
EXT4_I ( dir ) - > i_last_alloc_group = group ;
2006-10-11 01:20:50 -07:00
err = - ENOSPC ;
2008-01-28 23:58:27 -05:00
if ( ret2 = = - 1 )
2006-10-11 01:20:50 -07:00
goto out ;
2012-02-06 20:12:03 -05:00
/*
* Normally we will only go through one pass of this loop ,
* unless we get unlucky and it turns out the group we selected
* had its last inode grabbed by someone else .
*/
2009-06-13 11:45:35 -04:00
for ( i = 0 ; i < ngroups ; i + + , ino = 0 ) {
2006-10-11 01:20:50 -07:00
err = - EIO ;
2009-01-03 22:33:39 -05:00
gdp = ext4_get_group_desc ( sb , group , & group_desc_bh ) ;
2006-10-11 01:20:50 -07:00
if ( ! gdp )
2013-04-19 13:38:14 -04:00
goto out ;
2006-10-11 01:20:50 -07:00
2012-09-23 23:16:03 -04:00
/*
* Check free inodes count before loading bitmap .
*/
if ( ext4_free_inodes_count ( sb , gdp ) = = 0 ) {
if ( + + group = = ngroups )
group = 0 ;
continue ;
}
2013-08-28 18:32:58 -04:00
grp = ext4_get_group_info ( sb , group ) ;
/* Skip groups with already-known suspicious inode tables */
if ( EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) ) {
if ( + + group = = ngroups )
group = 0 ;
continue ;
}
2009-01-03 22:33:39 -05:00
brelse ( inode_bitmap_bh ) ;
inode_bitmap_bh = ext4_read_inode_bitmap ( sb , group ) ;
2013-08-28 18:32:58 -04:00
/* Skip groups with suspicious inode tables */
2015-10-17 21:33:24 -04:00
if ( EXT4_MB_GRP_IBITMAP_CORRUPT ( grp ) | |
IS_ERR ( inode_bitmap_bh ) ) {
inode_bitmap_bh = NULL ;
2013-08-28 18:32:58 -04:00
if ( + + group = = ngroups )
group = 0 ;
continue ;
}
2006-10-11 01:20:50 -07:00
repeat_in_this_group :
2006-10-11 01:20:53 -07:00
ino = ext4_find_next_zero_bit ( ( unsigned long * )
2009-01-03 22:33:39 -05:00
inode_bitmap_bh - > b_data ,
EXT4_INODES_PER_GROUP ( sb ) , ino ) ;
2013-07-26 15:15:46 -04:00
if ( ino > = EXT4_INODES_PER_GROUP ( sb ) )
goto next_group ;
2012-02-06 20:12:03 -05:00
if ( group = = 0 & & ( ino + 1 ) < EXT4_FIRST_INO ( sb ) ) {
ext4_error ( sb , " reserved inode found cleared - "
" inode=%lu " , ino + 1 ) ;
continue ;
}
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
if ( ( EXT4_SB ( sb ) - > s_journal = = NULL ) & &
recently_deleted ( sb , group , ino ) ) {
ino + + ;
goto next_inode ;
}
2013-02-09 16:27:09 -05:00
if ( ! handle ) {
BUG_ON ( nblocks < = 0 ) ;
handle = __ext4_journal_start_sb ( dir - > i_sb , line_no ,
2013-06-04 12:37:50 -04:00
handle_type , nblocks ,
0 ) ;
2013-02-09 16:27:09 -05:00
if ( IS_ERR ( handle ) ) {
err = PTR_ERR ( handle ) ;
2013-04-19 13:38:14 -04:00
ext4_std_error ( sb , err ) ;
goto out ;
2013-02-09 16:27:09 -05:00
}
}
2012-10-28 22:24:57 -04:00
BUFFER_TRACE ( inode_bitmap_bh , " get_write_access " ) ;
err = ext4_journal_get_write_access ( handle , inode_bitmap_bh ) ;
2013-04-19 13:38:14 -04:00
if ( err ) {
ext4_std_error ( sb , err ) ;
goto out ;
}
2012-02-06 20:12:03 -05:00
ext4_lock_group ( sb , group ) ;
ret2 = ext4_test_and_set_bit ( ino , inode_bitmap_bh - > b_data ) ;
ext4_unlock_group ( sb , group ) ;
ino + + ; /* the inode bitmap is zero-based */
if ( ! ret2 )
goto got ; /* we grabbed the inode! */
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
next_inode :
2012-02-06 20:12:03 -05:00
if ( ino < EXT4_INODES_PER_GROUP ( sb ) )
goto repeat_in_this_group ;
2013-07-26 15:15:46 -04:00
next_group :
if ( + + group = = ngroups )
group = 0 ;
2006-10-11 01:20:50 -07:00
}
err = - ENOSPC ;
goto out ;
got :
2012-10-28 22:24:57 -04:00
BUFFER_TRACE ( inode_bitmap_bh , " call ext4_handle_dirty_metadata " ) ;
err = ext4_handle_dirty_metadata ( handle , NULL , inode_bitmap_bh ) ;
2013-04-19 13:38:14 -04:00
if ( err ) {
ext4_std_error ( sb , err ) ;
goto out ;
}
2012-10-28 22:24:57 -04:00
2014-07-05 16:28:35 -04:00
BUFFER_TRACE ( group_desc_bh , " get_write_access " ) ;
err = ext4_journal_get_write_access ( handle , group_desc_bh ) ;
if ( err ) {
ext4_std_error ( sb , err ) ;
goto out ;
}
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
/* We may have to initialize the block bitmap if it isn't already */
2012-04-29 18:45:10 -04:00
if ( ext4_has_group_desc_csum ( sb ) & &
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_BLOCK_UNINIT ) ) {
2009-01-03 22:33:39 -05:00
struct buffer_head * block_bitmap_bh ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
2009-01-03 22:33:39 -05:00
block_bitmap_bh = ext4_read_block_bitmap ( sb , group ) ;
2015-10-17 21:33:24 -04:00
if ( IS_ERR ( block_bitmap_bh ) ) {
err = PTR_ERR ( block_bitmap_bh ) ;
2014-10-30 10:53:16 -04:00
goto out ;
}
2009-01-03 22:33:39 -05:00
BUFFER_TRACE ( block_bitmap_bh , " get block bitmap access " ) ;
err = ext4_journal_get_write_access ( handle , block_bitmap_bh ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
if ( err ) {
2009-01-03 22:33:39 -05:00
brelse ( block_bitmap_bh ) ;
2013-04-19 13:38:14 -04:00
ext4_std_error ( sb , err ) ;
goto out ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
2011-09-09 18:42:51 -04:00
BUFFER_TRACE ( block_bitmap_bh , " dirty block bitmap " ) ;
err = ext4_handle_dirty_metadata ( handle , NULL , block_bitmap_bh ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
/* recheck and clear flag under lock if we still need to */
2011-09-09 18:42:51 -04:00
ext4_lock_group ( sb , group ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
if ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_BLOCK_UNINIT ) ) {
2009-01-03 22:33:39 -05:00
gdp - > bg_flags & = cpu_to_le16 ( ~ EXT4_BG_BLOCK_UNINIT ) ;
2011-09-09 19:08:51 -04:00
ext4_free_group_clusters_set ( sb , gdp ,
2011-09-09 19:12:51 -04:00
ext4_free_clusters_after_init ( sb , group , gdp ) ) ;
2012-04-29 18:35:10 -04:00
ext4_block_bitmap_csum_set ( sb , group , gdp ,
2012-10-22 00:34:32 -04:00
block_bitmap_bh ) ;
2012-04-29 18:45:10 -04:00
ext4_group_desc_csum_set ( sb , group , gdp ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
2009-05-02 20:35:09 -04:00
ext4_unlock_group ( sb , group ) ;
2012-11-29 21:21:22 -05:00
brelse ( block_bitmap_bh ) ;
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
2013-04-19 13:38:14 -04:00
if ( err ) {
ext4_std_error ( sb , err ) ;
goto out ;
}
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
}
2012-02-06 20:12:03 -05:00
/* Update the relevant bg descriptor fields */
2012-04-29 18:33:10 -04:00
if ( ext4_has_group_desc_csum ( sb ) ) {
2012-02-06 20:12:03 -05:00
int free ;
struct ext4_group_info * grp = ext4_get_group_info ( sb , group ) ;
down_read ( & grp - > alloc_sem ) ; /* protect vs itable lazyinit */
ext4_lock_group ( sb , group ) ; /* while we modify the bg desc */
free = EXT4_INODES_PER_GROUP ( sb ) -
ext4_itable_unused_count ( sb , gdp ) ;
if ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_UNINIT ) ) {
gdp - > bg_flags & = cpu_to_le16 ( ~ EXT4_BG_INODE_UNINIT ) ;
free = 0 ;
}
/*
* Check the relative inode number against the last used
* relative inode number in this group . if it is greater
* we need to update the bg_itable_unused count
*/
if ( ino > free )
ext4_itable_unused_set ( sb , gdp ,
( EXT4_INODES_PER_GROUP ( sb ) - ino ) ) ;
up_read ( & grp - > alloc_sem ) ;
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
} else {
ext4_lock_group ( sb , group ) ;
2012-02-06 20:12:03 -05:00
}
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
2012-02-06 20:12:03 -05:00
ext4_free_inodes_set ( sb , gdp , ext4_free_inodes_count ( sb , gdp ) - 1 ) ;
if ( S_ISDIR ( mode ) ) {
ext4_used_dirs_set ( sb , gdp , ext4_used_dirs_count ( sb , gdp ) + 1 ) ;
if ( sbi - > s_log_groups_per_flex ) {
ext4_group_t f = ext4_flex_group ( sbi , group ) ;
atomic_inc ( & sbi - > s_flex_groups [ f ] . used_dirs ) ;
}
}
2012-04-29 18:33:10 -04:00
if ( ext4_has_group_desc_csum ( sb ) ) {
ext4_inode_bitmap_csum_set ( sb , group , gdp , inode_bitmap_bh ,
EXT4_INODES_PER_GROUP ( sb ) / 8 ) ;
2012-04-29 18:45:10 -04:00
ext4_group_desc_csum_set ( sb , group , gdp ) ;
2012-02-06 20:12:03 -05:00
}
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
ext4_unlock_group ( sb , group ) ;
2012-02-06 20:12:03 -05:00
2009-01-03 22:33:39 -05:00
BUFFER_TRACE ( group_desc_bh , " call ext4_handle_dirty_metadata " ) ;
err = ext4_handle_dirty_metadata ( handle , NULL , group_desc_bh ) ;
2013-04-19 13:38:14 -04:00
if ( err ) {
ext4_std_error ( sb , err ) ;
goto out ;
}
2006-10-11 01:20:50 -07:00
percpu_counter_dec ( & sbi - > s_freeinodes_counter ) ;
if ( S_ISDIR ( mode ) )
percpu_counter_inc ( & sbi - > s_dirs_counter ) ;
2008-07-11 19:27:31 -04:00
if ( sbi - > s_log_groups_per_flex ) {
flex_group = ext4_flex_group ( sbi , group ) ;
2009-03-04 19:09:10 -05:00
atomic_dec ( & sbi - > s_flex_groups [ flex_group ] . free_inodes ) ;
2008-07-11 19:27:31 -04:00
}
2006-10-11 01:20:50 -07:00
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
inode - > i_ino = ino + group * EXT4_INODES_PER_GROUP ( sb ) ;
2006-10-11 01:20:50 -07:00
/* This is the optimal IO size (for stat), not the fs block size */
inode - > i_blocks = 0 ;
2007-07-18 09:15:20 -04:00
inode - > i_mtime = inode - > i_atime = inode - > i_ctime = ei - > i_crtime =
2016-11-14 21:40:10 -05:00
current_time ( inode ) ;
2006-10-11 01:20:50 -07:00
memset ( ei - > i_data , 0 , sizeof ( ei - > i_data ) ) ;
ei - > i_dir_start_lookup = 0 ;
ei - > i_disksize = 0 ;
2011-10-31 18:21:29 -04:00
/* Don't inherit extent flag from directory, amongst others. */
2009-02-15 18:09:20 -05:00
ei - > i_flags =
ext4_mask_flags ( mode , EXT4_I ( dir ) - > i_flags & EXT4_FL_INHERITED ) ;
2017-06-21 21:21:39 -04:00
ei - > i_flags | = i_flags ;
2006-10-11 01:20:50 -07:00
ei - > i_file_acl = 0 ;
ei - > i_dtime = 0 ;
ei - > i_block_group = group ;
2009-03-12 12:18:34 -04:00
ei - > i_last_alloc_group = ~ 0 ;
2006-10-11 01:20:50 -07:00
2006-10-11 01:20:53 -07:00
ext4_set_inode_flags ( inode ) ;
2006-10-11 01:20:50 -07:00
if ( IS_DIRSYNC ( inode ) )
2009-01-07 00:06:22 -05:00
ext4_handle_sync ( handle ) ;
2008-12-30 02:03:31 -05:00
if ( insert_inode_locked ( inode ) < 0 ) {
2011-12-18 17:37:02 -05:00
/*
* Likely a bitmap corruption causing inode to be allocated
* twice .
*/
err = - EIO ;
2013-04-19 13:38:14 -04:00
ext4_error ( sb , " failed to insert inode %lu: doubly allocated? " ,
inode - > i_ino ) ;
goto out ;
2008-12-30 02:03:31 -05:00
}
2006-10-11 01:20:50 -07:00
spin_lock ( & sbi - > s_next_gen_lock ) ;
inode - > i_generation = sbi - > s_next_generation + + ;
spin_unlock ( & sbi - > s_next_gen_lock ) ;
2012-04-29 18:31:10 -04:00
/* Precompute checksum seed for inode metadata */
2014-10-13 03:36:16 -04:00
if ( ext4_has_metadata_csum ( sb ) ) {
2012-04-29 18:31:10 -04:00
__u32 csum ;
__le32 inum = cpu_to_le32 ( inode - > i_ino ) ;
__le32 gen = cpu_to_le32 ( inode - > i_generation ) ;
csum = ext4_chksum ( sbi , sbi - > s_csum_seed , ( __u8 * ) & inum ,
sizeof ( inum ) ) ;
ei - > i_csum_seed = ext4_chksum ( sbi , csum , ( __u8 * ) & gen ,
sizeof ( gen ) ) ;
}
2011-01-10 12:18:25 -05:00
ext4_clear_state_flags ( ei ) ; /* Only relevant on 32-bit archs */
2010-01-24 14:34:07 -05:00
ext4_set_inode_state ( inode , EXT4_STATE_NEW ) ;
2007-07-18 09:15:20 -04:00
ei - > i_extra_isize = EXT4_SB ( sb ) - > s_want_extra_isize ;
2012-12-10 14:06:03 -05:00
ei - > i_inline_off = 0 ;
2015-10-17 16:18:43 -04:00
if ( ext4_has_feature_inline_data ( sb ) )
2012-12-10 14:06:03 -05:00
ext4_set_inode_state ( inode , EXT4_STATE_MAY_INLINE_DATA ) ;
2006-10-11 01:20:50 -07:00
ret = inode ;
2010-03-03 09:05:01 -05:00
err = dquot_alloc_inode ( inode ) ;
if ( err )
2006-10-11 01:20:50 -07:00
goto fail_drop ;
ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block. Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block. This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.
To improve the situation, change the inheritance order to encryption,
ACL, then security. This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.
Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case. However, it's not the default.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
/*
* Since the encryption xattr will always be unique , create it first so
* that it ' s less likely to end up in an external xattr block and
* prevent its deduplication .
*/
if ( encrypt ) {
err = fscrypt_inherit_context ( dir , inode , handle , true ) ;
if ( err )
goto fail_free_drop ;
}
2017-06-21 21:21:39 -04:00
if ( ! ( ei - > i_flags & EXT4_EA_INODE_FL ) ) {
err = ext4_init_acl ( handle , inode , dir ) ;
if ( err )
goto fail_free_drop ;
2006-10-11 01:20:50 -07:00
2017-07-06 00:00:59 -04:00
err = ext4_init_security ( handle , inode , dir , qstr ) ;
if ( err )
goto fail_free_drop ;
}
2006-10-11 01:20:50 -07:00
2015-10-17 16:18:43 -04:00
if ( ext4_has_feature_extents ( sb ) ) {
2008-07-11 19:27:31 -04:00
/* set extent flag only for directory, file and normal symlink*/
2008-04-29 08:11:12 -04:00
if ( S_ISDIR ( mode ) | | S_ISREG ( mode ) | | S_ISLNK ( mode ) ) {
2010-05-16 22:00:00 -04:00
ext4_set_inode_flag ( inode , EXT4_INODE_EXTENTS ) ;
2008-02-25 16:38:03 -05:00
ext4_ext_tree_init ( handle , inode ) ;
}
2006-10-11 01:21:03 -07:00
}
2006-10-11 01:20:50 -07:00
2011-03-16 17:16:31 -04:00
if ( ext4_handle_valid ( handle ) ) {
ei - > i_sync_tid = handle - > h_transaction - > t_tid ;
ei - > i_datasync_tid = handle - > h_transaction - > t_tid ;
}
2008-04-29 22:00:36 -04:00
err = ext4_mark_inode_dirty ( handle , inode ) ;
if ( err ) {
ext4_std_error ( sb , err ) ;
goto fail_free_drop ;
}
2006-10-11 01:20:53 -07:00
ext4_debug ( " allocating inode %lu \n " , inode - > i_ino ) ;
2009-06-17 11:48:11 -04:00
trace_ext4_allocate_inode ( inode , dir , mode ) ;
2009-01-03 22:33:39 -05:00
brelse ( inode_bitmap_bh ) ;
2006-10-11 01:20:50 -07:00
return ret ;
fail_free_drop :
2010-03-03 09:05:01 -05:00
dquot_free_inode ( inode ) ;
2006-10-11 01:20:50 -07:00
fail_drop :
2011-10-28 14:13:28 +02:00
clear_nlink ( inode ) ;
2008-12-30 02:03:31 -05:00
unlock_new_inode ( inode ) ;
2013-04-19 13:38:14 -04:00
out :
dquot_drop ( inode ) ;
inode - > i_flags | = S_NOQUOTA ;
2006-10-11 01:20:50 -07:00
iput ( inode ) ;
2009-01-03 22:33:39 -05:00
brelse ( inode_bitmap_bh ) ;
2006-10-11 01:20:50 -07:00
return ERR_PTR ( err ) ;
}
/* Verify that we are loading a valid orphan from disk */
2006-10-11 01:20:53 -07:00
struct inode * ext4_orphan_get ( struct super_block * sb , unsigned long ino )
2006-10-11 01:20:50 -07:00
{
2006-10-11 01:20:53 -07:00
unsigned long max_ino = le32_to_cpu ( EXT4_SB ( sb ) - > s_es - > s_inodes_count ) ;
2008-01-28 23:58:27 -05:00
ext4_group_t block_group ;
2006-10-11 01:20:50 -07:00
int bit ;
2016-04-30 00:49:54 -04:00
struct buffer_head * bitmap_bh = NULL ;
2006-10-11 01:20:50 -07:00
struct inode * inode = NULL ;
2016-04-30 00:49:54 -04:00
int err = - EFSCORRUPTED ;
2006-10-11 01:20:50 -07:00
2016-04-30 00:49:54 -04:00
if ( ino < EXT4_FIRST_INO ( sb ) | | ino > max_ino )
goto bad_orphan ;
2006-10-11 01:20:50 -07:00
2006-10-11 01:20:53 -07:00
block_group = ( ino - 1 ) / EXT4_INODES_PER_GROUP ( sb ) ;
bit = ( ino - 1 ) % EXT4_INODES_PER_GROUP ( sb ) ;
2008-08-02 21:21:02 -04:00
bitmap_bh = ext4_read_inode_bitmap ( sb , block_group ) ;
2015-10-17 21:33:24 -04:00
if ( IS_ERR ( bitmap_bh ) ) {
2016-04-30 00:49:54 -04:00
ext4_error ( sb , " inode bitmap error %ld for orphan %lu " ,
ino , PTR_ERR ( bitmap_bh ) ) ;
return ( struct inode * ) bitmap_bh ;
2006-10-11 01:20:50 -07:00
}
/* Having the inode bit set should be a 100% indicator that this
* is a valid orphan ( no e2fsck run on fs ) . Orphans also include
* inodes that were being truncated , so we can ' t check i_nlink = = 0.
*/
2008-02-07 00:15:37 -08:00
if ( ! ext4_test_bit ( bit , bitmap_bh - > b_data ) )
goto bad_orphan ;
inode = ext4_iget ( sb , ino ) ;
2016-04-30 00:49:54 -04:00
if ( IS_ERR ( inode ) ) {
err = PTR_ERR ( inode ) ;
ext4_error ( sb , " couldn't read orphan inode %lu (err %d) " ,
ino , err ) ;
return inode ;
}
2008-02-07 00:15:37 -08:00
2008-07-11 19:27:31 -04:00
/*
2016-04-30 00:48:54 -04:00
* If the orphans has i_nlinks > 0 then it should be able to
* be truncated , otherwise it won ' t be removed from the orphan
* list during processing and an infinite loop will result .
* Similarly , it must not be a bad inode .
2008-07-11 19:27:31 -04:00
*/
2016-04-30 00:48:54 -04:00
if ( ( inode - > i_nlink & & ! ext4_can_truncate ( inode ) ) | |
is_bad_inode ( inode ) )
2008-07-11 19:27:31 -04:00
goto bad_orphan ;
2008-02-07 00:15:37 -08:00
if ( NEXT_ORPHAN ( inode ) > max_ino )
goto bad_orphan ;
brelse ( bitmap_bh ) ;
return inode ;
bad_orphan :
2016-04-30 00:49:54 -04:00
ext4_error ( sb , " bad orphan inode %lu " , ino ) ;
if ( bitmap_bh )
printk ( KERN_ERR " ext4_test_bit(bit=%d, block=%llu) = %d \n " ,
bit , ( unsigned long long ) bitmap_bh - > b_blocknr ,
ext4_test_bit ( bit , bitmap_bh - > b_data ) ) ;
2008-02-07 00:15:37 -08:00
if ( inode ) {
2016-04-30 00:49:54 -04:00
printk ( KERN_ERR " is_bad_inode(inode)=%d \n " ,
2008-02-07 00:15:37 -08:00
is_bad_inode ( inode ) ) ;
2016-04-30 00:49:54 -04:00
printk ( KERN_ERR " NEXT_ORPHAN(inode)=%u \n " ,
2008-02-07 00:15:37 -08:00
NEXT_ORPHAN ( inode ) ) ;
2016-04-30 00:49:54 -04:00
printk ( KERN_ERR " max_ino=%lu \n " , max_ino ) ;
printk ( KERN_ERR " i_nlink=%u \n " , inode - > i_nlink ) ;
2006-10-11 01:20:50 -07:00
/* Avoid freeing blocks if we got a bad deleted inode */
2008-02-07 00:15:37 -08:00
if ( inode - > i_nlink = = 0 )
2006-10-11 01:20:50 -07:00
inode - > i_blocks = 0 ;
iput ( inode ) ;
}
brelse ( bitmap_bh ) ;
2008-02-07 00:15:37 -08:00
return ERR_PTR ( err ) ;
2006-10-11 01:20:50 -07:00
}
2008-09-08 22:25:24 -04:00
unsigned long ext4_count_free_inodes ( struct super_block * sb )
2006-10-11 01:20:50 -07:00
{
unsigned long desc_count ;
2006-10-11 01:20:53 -07:00
struct ext4_group_desc * gdp ;
2009-05-01 08:50:38 -04:00
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
2006-10-11 01:20:53 -07:00
# ifdef EXT4FS_DEBUG
struct ext4_super_block * es ;
2006-10-11 01:20:50 -07:00
unsigned long bitmap_count , x ;
struct buffer_head * bitmap_bh = NULL ;
2006-10-11 01:20:53 -07:00
es = EXT4_SB ( sb ) - > s_es ;
2006-10-11 01:20:50 -07:00
desc_count = 0 ;
bitmap_count = 0 ;
gdp = NULL ;
2009-05-01 08:50:38 -04:00
for ( i = 0 ; i < ngroups ; i + + ) {
2008-09-08 22:25:24 -04:00
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2006-10-11 01:20:50 -07:00
if ( ! gdp )
continue ;
2009-01-05 22:20:24 -05:00
desc_count + = ext4_free_inodes_count ( sb , gdp ) ;
2006-10-11 01:20:50 -07:00
brelse ( bitmap_bh ) ;
2008-08-02 21:21:02 -04:00
bitmap_bh = ext4_read_inode_bitmap ( sb , i ) ;
2015-10-17 21:33:24 -04:00
if ( IS_ERR ( bitmap_bh ) ) {
bitmap_bh = NULL ;
2006-10-11 01:20:50 -07:00
continue ;
2015-10-17 21:33:24 -04:00
}
2006-10-11 01:20:50 -07:00
2012-06-30 19:14:57 -04:00
x = ext4_count_free ( bitmap_bh - > b_data ,
EXT4_INODES_PER_GROUP ( sb ) / 8 ) ;
2008-01-28 23:58:27 -05:00
printk ( KERN_DEBUG " group %lu: stored = %d, counted = %lu \n " ,
2009-07-27 21:44:40 -04:00
( unsigned long ) i , ext4_free_inodes_count ( sb , gdp ) , x ) ;
2006-10-11 01:20:50 -07:00
bitmap_count + = x ;
}
brelse ( bitmap_bh ) ;
2008-09-08 23:00:52 -04:00
printk ( KERN_DEBUG " ext4_count_free_inodes: "
" stored = %u, computed = %lu, %lu \n " ,
le32_to_cpu ( es - > s_free_inodes_count ) , desc_count , bitmap_count ) ;
2006-10-11 01:20:50 -07:00
return desc_count ;
# else
desc_count = 0 ;
2009-05-01 08:50:38 -04:00
for ( i = 0 ; i < ngroups ; i + + ) {
2008-09-08 22:25:24 -04:00
gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2006-10-11 01:20:50 -07:00
if ( ! gdp )
continue ;
2009-01-05 22:20:24 -05:00
desc_count + = ext4_free_inodes_count ( sb , gdp ) ;
2006-10-11 01:20:50 -07:00
cond_resched ( ) ;
}
return desc_count ;
# endif
}
/* Called at mount-time, super-block is locked */
2008-09-08 22:25:24 -04:00
unsigned long ext4_count_dirs ( struct super_block * sb )
2006-10-11 01:20:50 -07:00
{
unsigned long count = 0 ;
2009-05-01 08:50:38 -04:00
ext4_group_t i , ngroups = ext4_get_groups_count ( sb ) ;
2006-10-11 01:20:50 -07:00
2009-05-01 08:50:38 -04:00
for ( i = 0 ; i < ngroups ; i + + ) {
2008-09-08 22:25:24 -04:00
struct ext4_group_desc * gdp = ext4_get_group_desc ( sb , i , NULL ) ;
2006-10-11 01:20:50 -07:00
if ( ! gdp )
continue ;
2009-01-05 22:20:24 -05:00
count + = ext4_used_dirs_count ( sb , gdp ) ;
2006-10-11 01:20:50 -07:00
}
return count ;
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
/*
* Zeroes not yet zeroed inode table - just write zeroes through the whole
* inode table . Must be called without any spinlock held . The only place
* where it is called from on active part of filesystem is ext4lazyinit
* thread , so we do not need any special locks , however we have to prevent
* inode allocation from the current group , so we take alloc_sem lock , to
2012-02-06 20:12:03 -05:00
* block ext4_new_inode ( ) until we are finished .
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
*/
2011-10-18 10:57:51 -04:00
int ext4_init_inode_table ( struct super_block * sb , ext4_group_t group ,
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
int barrier )
{
struct ext4_group_info * grp = ext4_get_group_info ( sb , group ) ;
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
struct ext4_group_desc * gdp = NULL ;
struct buffer_head * group_desc_bh ;
handle_t * handle ;
ext4_fsblk_t blk ;
int num , ret = 0 , used_blks = 0 ;
/* This should not happen, but just to be sure check this */
if ( sb - > s_flags & MS_RDONLY ) {
ret = 1 ;
goto out ;
}
gdp = ext4_get_group_desc ( sb , group , & group_desc_bh ) ;
if ( ! gdp )
goto out ;
/*
* We do not need to lock this , because we are the only one
* handling this flag .
*/
if ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) )
goto out ;
2013-02-08 21:59:22 -05:00
handle = ext4_journal_start_sb ( sb , EXT4_HT_MISC , 1 ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
if ( IS_ERR ( handle ) ) {
ret = PTR_ERR ( handle ) ;
goto out ;
}
down_write ( & grp - > alloc_sem ) ;
/*
* If inode bitmap was already initialized there may be some
* used inodes so we need to skip blocks with used inodes in
* inode table .
*/
if ( ! ( gdp - > bg_flags & cpu_to_le16 ( EXT4_BG_INODE_UNINIT ) ) )
used_blks = DIV_ROUND_UP ( ( EXT4_INODES_PER_GROUP ( sb ) -
ext4_itable_unused_count ( sb , gdp ) ) ,
sbi - > s_inodes_per_block ) ;
2010-10-27 21:30:05 -04:00
if ( ( used_blks < 0 ) | | ( used_blks > sbi - > s_itb_per_group ) ) {
2012-03-19 23:13:43 -04:00
ext4_error ( sb , " Something is wrong with group %u: "
" used itable blocks: %d; "
" itable unused count: %u " ,
2010-10-27 21:30:05 -04:00
group , used_blks ,
ext4_itable_unused_count ( sb , gdp ) ) ;
ret = 1 ;
2011-08-01 06:32:19 -04:00
goto err_out ;
2010-10-27 21:30:05 -04:00
}
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
blk = ext4_inode_table ( sb , gdp ) + used_blks ;
num = sbi - > s_itb_per_group - used_blks ;
BUFFER_TRACE ( group_desc_bh , " get_write_access " ) ;
ret = ext4_journal_get_write_access ( handle ,
group_desc_bh ) ;
if ( ret )
goto err_out ;
/*
* Skip zeroout if the inode table is full . But we set the ZEROED
* flag anyway , because obviously , when it is full it does not need
* further zeroing .
*/
if ( unlikely ( num = = 0 ) )
goto skip_zeroout ;
ext4_debug ( " going to zero out inode table in group %d \n " ,
group ) ;
2010-10-27 23:44:47 -04:00
ret = sb_issue_zeroout ( sb , blk , num , GFP_NOFS ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
if ( ret < 0 )
goto err_out ;
2010-10-27 23:44:47 -04:00
if ( barrier )
blkdev_issue_flush ( sb - > s_bdev , GFP_NOFS , NULL ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
skip_zeroout :
ext4_lock_group ( sb , group ) ;
gdp - > bg_flags | = cpu_to_le16 ( EXT4_BG_INODE_ZEROED ) ;
2012-04-29 18:45:10 -04:00
ext4_group_desc_csum_set ( sb , group , gdp ) ;
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
ext4_unlock_group ( sb , group ) ;
BUFFER_TRACE ( group_desc_bh ,
" call ext4_handle_dirty_metadata " ) ;
ret = ext4_handle_dirty_metadata ( handle , NULL ,
group_desc_bh ) ;
err_out :
up_write ( & grp - > alloc_sem ) ;
ext4_journal_stop ( handle ) ;
out :
return ret ;
}