3103084afc
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
136 lines
6.4 KiB
ReStructuredText
136 lines
6.4 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
Layout
|
|
------
|
|
|
|
The layout of a standard block group is approximately as follows (each
|
|
of these fields is discussed in a separate section below):
|
|
|
|
.. list-table::
|
|
:widths: 1 1 1 1 1 1 1 1
|
|
:header-rows: 1
|
|
|
|
* - Group 0 Padding
|
|
- ext4 Super Block
|
|
- Group Descriptors
|
|
- Reserved GDT Blocks
|
|
- Data Block Bitmap
|
|
- inode Bitmap
|
|
- inode Table
|
|
- Data Blocks
|
|
* - 1024 bytes
|
|
- 1 block
|
|
- many blocks
|
|
- many blocks
|
|
- 1 block
|
|
- 1 block
|
|
- many blocks
|
|
- many more blocks
|
|
|
|
For the special case of block group 0, the first 1024 bytes are unused,
|
|
to allow for the installation of x86 boot sectors and other oddities.
|
|
The superblock will start at offset 1024 bytes, whichever block that
|
|
happens to be (usually 0). However, if for some reason the block size =
|
|
1024, then block 0 is marked in use and the superblock goes in block 1.
|
|
For all other block groups, there is no padding.
|
|
|
|
The ext4 driver primarily works with the superblock and the group
|
|
descriptors that are found in block group 0. Redundant copies of the
|
|
superblock and group descriptors are written to some of the block groups
|
|
across the disk in case the beginning of the disk gets trashed, though
|
|
not all block groups necessarily host a redundant copy (see following
|
|
paragraph for more details). If the group does not have a redundant
|
|
copy, the block group begins with the data block bitmap. Note also that
|
|
when the filesystem is freshly formatted, mkfs will allocate “reserve
|
|
GDT block” space after the block group descriptors and before the start
|
|
of the block bitmaps to allow for future expansion of the filesystem. By
|
|
default, a filesystem is allowed to increase in size by a factor of
|
|
1024x over the original filesystem size.
|
|
|
|
The location of the inode table is given by ``grp.bg_inode_table_*``. It
|
|
is continuous range of blocks large enough to contain
|
|
``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
|
|
|
|
As for the ordering of items in a block group, it is generally
|
|
established that the super block and the group descriptor table, if
|
|
present, will be at the beginning of the block group. The bitmaps and
|
|
the inode table can be anywhere, and it is quite possible for the
|
|
bitmaps to come after the inode table, or for both to be in different
|
|
groups (flex_bg). Leftover space is used for file data blocks, indirect
|
|
block maps, extent tree blocks, and extended attributes.
|
|
|
|
Flexible Block Groups
|
|
---------------------
|
|
|
|
Starting in ext4, there is a new feature called flexible block groups
|
|
(flex_bg). In a flex_bg, several block groups are tied together as one
|
|
logical block group; the bitmap spaces and the inode table space in the
|
|
first block group of the flex_bg are expanded to include the bitmaps
|
|
and inode tables of all other block groups in the flex_bg. For example,
|
|
if the flex_bg size is 4, then group 0 will contain (in order) the
|
|
superblock, group descriptors, data block bitmaps for groups 0-3, inode
|
|
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
|
|
space in group 0 is for file data. The effect of this is to group the
|
|
block group metadata close together for faster loading, and to enable
|
|
large files to be continuous on disk. Backup copies of the superblock
|
|
and group descriptors are always at the beginning of block groups, even
|
|
if flex_bg is enabled. The number of block groups that make up a
|
|
flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
|
|
|
|
Meta Block Groups
|
|
-----------------
|
|
|
|
Without the option META_BG, for safety concerns, all block group
|
|
descriptors copies are kept in the first block group. Given the default
|
|
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
|
|
can have at most 2^27/64 = 2^21 block groups. This limits the entire
|
|
filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
|
|
|
|
The solution to this problem is to use the metablock group feature
|
|
(META_BG), which is already in ext3 for all 2.6 releases. With the
|
|
META_BG feature, ext4 filesystems are partitioned into many metablock
|
|
groups. Each metablock group is a cluster of block groups whose group
|
|
descriptor structures can be stored in a single disk block. For ext4
|
|
filesystems with 4 KB block size, a single metablock group partition
|
|
includes 64 block groups, or 8 GiB of disk space. The metablock group
|
|
feature moves the location of the group descriptors from the congested
|
|
first block group of the whole filesystem into the first group of each
|
|
metablock group itself. The backups are in the second and last group of
|
|
each metablock group. This increases the 2^21 maximum block groups limit
|
|
to the hard limit 2^32, allowing support for a 512PiB filesystem.
|
|
|
|
The change in the filesystem format replaces the current scheme where
|
|
the superblock is followed by a variable-length set of block group
|
|
descriptors. Instead, the superblock and a single block group descriptor
|
|
block is placed at the beginning of the first, second, and last block
|
|
groups in a meta-block group. A meta-block group is a collection of
|
|
block groups which can be described by a single block group descriptor
|
|
block. Since the size of the block group descriptor structure is 32
|
|
bytes, a meta-block group contains 32 block groups for filesystems with
|
|
a 1KB block size, and 128 block groups for filesystems with a 4KB
|
|
blocksize. Filesystems can either be created using this new block group
|
|
descriptor layout, or existing filesystems can be resized on-line, and
|
|
the field s_first_meta_bg in the superblock will indicate the first
|
|
block group using this new layout.
|
|
|
|
Please see an important note about ``BLOCK_UNINIT`` in the section about
|
|
block and inode bitmaps.
|
|
|
|
Lazy Block Group Initialization
|
|
-------------------------------
|
|
|
|
A new feature for ext4 are three block group descriptor flags that
|
|
enable mkfs to skip initializing other parts of the block group
|
|
metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
|
|
that the inode and block bitmaps for that group can be calculated and
|
|
therefore the on-disk bitmap blocks are not initialized. This is
|
|
generally the case for an empty block group or a block group containing
|
|
only fixed-location block group metadata. The INODE_ZEROED flag means
|
|
that the inode table has been initialized; mkfs will unset this flag and
|
|
rely on the kernel to initialize the inode tables in the background.
|
|
|
|
By not writing zeroes to the bitmaps and inode table, mkfs time is
|
|
reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
|
|
but the dumpe2fs output prints this as “uninit_bg”. They are the same
|
|
thing.
|