3103084afc
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-2-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
579 lines
17 KiB
ReStructuredText
579 lines
17 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
Index Nodes
|
|
-----------
|
|
|
|
In a regular UNIX filesystem, the inode stores all the metadata
|
|
pertaining to the file (time stamps, block maps, extended attributes,
|
|
etc), not the directory entry. To find the information associated with a
|
|
file, one must traverse the directory files to find the directory entry
|
|
associated with a file, then load the inode to find the metadata for
|
|
that file. ext4 appears to cheat (for performance reasons) a little bit
|
|
by storing a copy of the file type (normally stored in the inode) in the
|
|
directory entry. (Compare all this to FAT, which stores all the file
|
|
information directly in the directory entry, but does not support hard
|
|
links and is in general more seek-happy than ext4 due to its simpler
|
|
block allocator and extensive use of linked lists.)
|
|
|
|
The inode table is a linear array of ``struct ext4_inode``. The table is
|
|
sized to have enough blocks to store at least
|
|
``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
|
|
block group containing an inode can be calculated as
|
|
``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
|
|
group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
|
|
is no inode 0.
|
|
|
|
The inode checksum is calculated against the FS UUID, the inode number,
|
|
and the inode structure itself.
|
|
|
|
The inode table entry is laid out in ``struct ext4_inode``.
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
:class: longtable
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le16
|
|
- i_mode
|
|
- File mode. See the table i_mode_ below.
|
|
* - 0x2
|
|
- __le16
|
|
- i_uid
|
|
- Lower 16-bits of Owner UID.
|
|
* - 0x4
|
|
- __le32
|
|
- i_size_lo
|
|
- Lower 32-bits of size in bytes.
|
|
* - 0x8
|
|
- __le32
|
|
- i_atime
|
|
- Last access time, in seconds since the epoch. However, if the EA_INODE
|
|
inode flag is set, this inode stores an extended attribute value and
|
|
this field contains the checksum of the value.
|
|
* - 0xC
|
|
- __le32
|
|
- i_ctime
|
|
- Last inode change time, in seconds since the epoch. However, if the
|
|
EA_INODE inode flag is set, this inode stores an extended attribute
|
|
value and this field contains the lower 32 bits of the attribute value's
|
|
reference count.
|
|
* - 0x10
|
|
- __le32
|
|
- i_mtime
|
|
- Last data modification time, in seconds since the epoch. However, if the
|
|
EA_INODE inode flag is set, this inode stores an extended attribute
|
|
value and this field contains the number of the inode that owns the
|
|
extended attribute.
|
|
* - 0x14
|
|
- __le32
|
|
- i_dtime
|
|
- Deletion Time, in seconds since the epoch.
|
|
* - 0x18
|
|
- __le16
|
|
- i_gid
|
|
- Lower 16-bits of GID.
|
|
* - 0x1A
|
|
- __le16
|
|
- i_links_count
|
|
- Hard link count. Normally, ext4 does not permit an inode to have more
|
|
than 65,000 hard links. This applies to files as well as directories,
|
|
which means that there cannot be more than 64,998 subdirectories in a
|
|
directory (each subdirectory's '..' entry counts as a hard link, as does
|
|
the '.' entry in the directory itself). With the DIR_NLINK feature
|
|
enabled, ext4 supports more than 64,998 subdirectories by setting this
|
|
field to 1 to indicate that the number of hard links is not known.
|
|
* - 0x1C
|
|
- __le32
|
|
- i_blocks_lo
|
|
- Lower 32-bits of “block” count. If the huge_file feature flag is not
|
|
set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
|
|
on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
|
|
``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
|
|
<< 32)`` 512-byte blocks on disk. If huge_file is set and
|
|
EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
|
|
consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
|
|
disk.
|
|
* - 0x20
|
|
- __le32
|
|
- i_flags
|
|
- Inode flags. See the table i_flags_ below.
|
|
* - 0x24
|
|
- 4 bytes
|
|
- i_osd1
|
|
- See the table i_osd1_ for more details.
|
|
* - 0x28
|
|
- 60 bytes
|
|
- i_block[EXT4_N_BLOCKS=15]
|
|
- Block map or extent tree. See the section “The Contents of inode.i_block”.
|
|
* - 0x64
|
|
- __le32
|
|
- i_generation
|
|
- File version (for NFS).
|
|
* - 0x68
|
|
- __le32
|
|
- i_file_acl_lo
|
|
- Lower 32-bits of extended attribute block. ACLs are of course one of
|
|
many possible extended attributes; I think the name of this field is a
|
|
result of the first use of extended attributes being for ACLs.
|
|
* - 0x6C
|
|
- __le32
|
|
- i_size_high / i_dir_acl
|
|
- Upper 32-bits of file/directory size. In ext2/3 this field was named
|
|
i_dir_acl, though it was usually set to zero and never used.
|
|
* - 0x70
|
|
- __le32
|
|
- i_obso_faddr
|
|
- (Obsolete) fragment address.
|
|
* - 0x74
|
|
- 12 bytes
|
|
- i_osd2
|
|
- See the table i_osd2_ for more details.
|
|
* - 0x80
|
|
- __le16
|
|
- i_extra_isize
|
|
- Size of this inode - 128. Alternately, the size of the extended inode
|
|
fields beyond the original ext2 inode, including this field.
|
|
* - 0x82
|
|
- __le16
|
|
- i_checksum_hi
|
|
- Upper 16-bits of the inode checksum.
|
|
* - 0x84
|
|
- __le32
|
|
- i_ctime_extra
|
|
- Extra change time bits. This provides sub-second precision. See Inode
|
|
Timestamps section.
|
|
* - 0x88
|
|
- __le32
|
|
- i_mtime_extra
|
|
- Extra modification time bits. This provides sub-second precision.
|
|
* - 0x8C
|
|
- __le32
|
|
- i_atime_extra
|
|
- Extra access time bits. This provides sub-second precision.
|
|
* - 0x90
|
|
- __le32
|
|
- i_crtime
|
|
- File creation time, in seconds since the epoch.
|
|
* - 0x94
|
|
- __le32
|
|
- i_crtime_extra
|
|
- Extra file creation time bits. This provides sub-second precision.
|
|
* - 0x98
|
|
- __le32
|
|
- i_version_hi
|
|
- Upper 32-bits for version number.
|
|
* - 0x9C
|
|
- __le32
|
|
- i_projid
|
|
- Project ID.
|
|
|
|
.. _i_mode:
|
|
|
|
The ``i_mode`` value is a combination of the following flags:
|
|
|
|
.. list-table::
|
|
:widths: 16 64
|
|
:header-rows: 1
|
|
|
|
* - Value
|
|
- Description
|
|
* - 0x1
|
|
- S_IXOTH (Others may execute)
|
|
* - 0x2
|
|
- S_IWOTH (Others may write)
|
|
* - 0x4
|
|
- S_IROTH (Others may read)
|
|
* - 0x8
|
|
- S_IXGRP (Group members may execute)
|
|
* - 0x10
|
|
- S_IWGRP (Group members may write)
|
|
* - 0x20
|
|
- S_IRGRP (Group members may read)
|
|
* - 0x40
|
|
- S_IXUSR (Owner may execute)
|
|
* - 0x80
|
|
- S_IWUSR (Owner may write)
|
|
* - 0x100
|
|
- S_IRUSR (Owner may read)
|
|
* - 0x200
|
|
- S_ISVTX (Sticky bit)
|
|
* - 0x400
|
|
- S_ISGID (Set GID)
|
|
* - 0x800
|
|
- S_ISUID (Set UID)
|
|
* -
|
|
- These are mutually-exclusive file types:
|
|
* - 0x1000
|
|
- S_IFIFO (FIFO)
|
|
* - 0x2000
|
|
- S_IFCHR (Character device)
|
|
* - 0x4000
|
|
- S_IFDIR (Directory)
|
|
* - 0x6000
|
|
- S_IFBLK (Block device)
|
|
* - 0x8000
|
|
- S_IFREG (Regular file)
|
|
* - 0xA000
|
|
- S_IFLNK (Symbolic link)
|
|
* - 0xC000
|
|
- S_IFSOCK (Socket)
|
|
|
|
.. _i_flags:
|
|
|
|
The ``i_flags`` field is a combination of these values:
|
|
|
|
.. list-table::
|
|
:widths: 16 64
|
|
:header-rows: 1
|
|
|
|
* - Value
|
|
- Description
|
|
* - 0x1
|
|
- This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
|
|
* - 0x2
|
|
- This file should be preserved, should undeletion be desired
|
|
(EXT4_UNRM_FL). (not implemented)
|
|
* - 0x4
|
|
- File is compressed (EXT4_COMPR_FL). (not really implemented)
|
|
* - 0x8
|
|
- All writes to the file must be synchronous (EXT4_SYNC_FL).
|
|
* - 0x10
|
|
- File is immutable (EXT4_IMMUTABLE_FL).
|
|
* - 0x20
|
|
- File can only be appended (EXT4_APPEND_FL).
|
|
* - 0x40
|
|
- The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
|
|
* - 0x80
|
|
- Do not update access time (EXT4_NOATIME_FL).
|
|
* - 0x100
|
|
- Dirty compressed file (EXT4_DIRTY_FL). (not used)
|
|
* - 0x200
|
|
- File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
|
|
* - 0x400
|
|
- Do not compress file (EXT4_NOCOMPR_FL). (not used)
|
|
* - 0x800
|
|
- Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
|
|
EXT4_ECOMPR_FL (compression error), which was never used.
|
|
* - 0x1000
|
|
- Directory has hashed indexes (EXT4_INDEX_FL).
|
|
* - 0x2000
|
|
- AFS magic directory (EXT4_IMAGIC_FL).
|
|
* - 0x4000
|
|
- File data must always be written through the journal
|
|
(EXT4_JOURNAL_DATA_FL).
|
|
* - 0x8000
|
|
- File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
|
|
* - 0x10000
|
|
- All directory entry data should be written synchronously (see
|
|
``dirsync``) (EXT4_DIRSYNC_FL).
|
|
* - 0x20000
|
|
- Top of directory hierarchy (EXT4_TOPDIR_FL).
|
|
* - 0x40000
|
|
- This is a huge file (EXT4_HUGE_FILE_FL).
|
|
* - 0x80000
|
|
- Inode uses extents (EXT4_EXTENTS_FL).
|
|
* - 0x100000
|
|
- Verity protected file (EXT4_VERITY_FL).
|
|
* - 0x200000
|
|
- Inode stores a large extended attribute value in its data blocks
|
|
(EXT4_EA_INODE_FL).
|
|
* - 0x400000
|
|
- This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
|
|
(deprecated)
|
|
* - 0x01000000
|
|
- Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
|
|
* - 0x04000000
|
|
- Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
|
|
mainline)
|
|
* - 0x08000000
|
|
- Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
|
|
mainline)
|
|
* - 0x10000000
|
|
- Inode has inline data (EXT4_INLINE_DATA_FL).
|
|
* - 0x20000000
|
|
- Create children with the same project ID (EXT4_PROJINHERIT_FL).
|
|
* - 0x80000000
|
|
- Reserved for ext4 library (EXT4_RESERVED_FL).
|
|
* -
|
|
- Aggregate flags:
|
|
* - 0x705BDFFF
|
|
- User-visible flags.
|
|
* - 0x604BC0FF
|
|
- User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
|
|
EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
|
|
EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
|
|
these flags in a special manner and they are masked out of the set of
|
|
flags that are saved directly to i_flags.
|
|
|
|
.. _i_osd1:
|
|
|
|
The ``osd1`` field has multiple meanings depending on the creator:
|
|
|
|
Linux:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le32
|
|
- l_i_version
|
|
- Inode version. However, if the EA_INODE inode flag is set, this inode
|
|
stores an extended attribute value and this field contains the upper 32
|
|
bits of the attribute value's reference count.
|
|
|
|
Hurd:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le32
|
|
- h_i_translator
|
|
- ??
|
|
|
|
Masix:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le32
|
|
- m_i_reserved
|
|
- ??
|
|
|
|
.. _i_osd2:
|
|
|
|
The ``osd2`` field has multiple meanings depending on the filesystem creator:
|
|
|
|
Linux:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le16
|
|
- l_i_blocks_high
|
|
- Upper 16-bits of the block count. Please see the note attached to
|
|
i_blocks_lo.
|
|
* - 0x2
|
|
- __le16
|
|
- l_i_file_acl_high
|
|
- Upper 16-bits of the extended attribute block (historically, the file
|
|
ACL location). See the Extended Attributes section below.
|
|
* - 0x4
|
|
- __le16
|
|
- l_i_uid_high
|
|
- Upper 16-bits of the Owner UID.
|
|
* - 0x6
|
|
- __le16
|
|
- l_i_gid_high
|
|
- Upper 16-bits of the GID.
|
|
* - 0x8
|
|
- __le16
|
|
- l_i_checksum_lo
|
|
- Lower 16-bits of the inode checksum.
|
|
* - 0xA
|
|
- __le16
|
|
- l_i_reserved
|
|
- Unused.
|
|
|
|
Hurd:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le16
|
|
- h_i_reserved1
|
|
- ??
|
|
* - 0x2
|
|
- __u16
|
|
- h_i_mode_high
|
|
- Upper 16-bits of the file mode.
|
|
* - 0x4
|
|
- __le16
|
|
- h_i_uid_high
|
|
- Upper 16-bits of the Owner UID.
|
|
* - 0x6
|
|
- __le16
|
|
- h_i_gid_high
|
|
- Upper 16-bits of the GID.
|
|
* - 0x8
|
|
- __u32
|
|
- h_i_author
|
|
- Author code?
|
|
|
|
Masix:
|
|
|
|
.. list-table::
|
|
:widths: 8 8 24 40
|
|
:header-rows: 1
|
|
|
|
* - Offset
|
|
- Size
|
|
- Name
|
|
- Description
|
|
* - 0x0
|
|
- __le16
|
|
- h_i_reserved1
|
|
- ??
|
|
* - 0x2
|
|
- __u16
|
|
- m_i_file_acl_high
|
|
- Upper 16-bits of the extended attribute block (historically, the file
|
|
ACL location).
|
|
* - 0x4
|
|
- __u32
|
|
- m_i_reserved2[2]
|
|
- ??
|
|
|
|
Inode Size
|
|
~~~~~~~~~~
|
|
|
|
In ext2 and ext3, the inode structure size was fixed at 128 bytes
|
|
(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
|
|
128 bytes. Starting with ext4, it is possible to allocate a larger
|
|
on-disk inode at format time for all inodes in the filesystem to provide
|
|
space beyond the end of the original ext2 inode. The on-disk inode
|
|
record size is recorded in the superblock as ``s_inode_size``. The
|
|
number of bytes actually used by struct ext4_inode beyond the original
|
|
128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
|
|
inode, which allows struct ext4_inode to grow for a new kernel without
|
|
having to upgrade all of the on-disk inodes. Access to fields beyond
|
|
EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
|
|
``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
|
|
of August 2019) the inode structure is 160 bytes
|
|
(``i_extra_isize = 32``). The extra space between the end of the inode
|
|
structure and the end of the inode record can be used to store extended
|
|
attributes. Each inode record can be as large as the filesystem block
|
|
size, though this is not terribly efficient.
|
|
|
|
Finding an Inode
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Each block group contains ``sb->s_inodes_per_group`` inodes. Because
|
|
inode 0 is defined not to exist, this formula can be used to find the
|
|
block group that an inode lives in:
|
|
``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
|
|
can be found within the block group's inode table at
|
|
``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
|
|
address within the inode table, use
|
|
``offset = index * sb->s_inode_size``.
|
|
|
|
Inode Timestamps
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Four timestamps are recorded in the lower 128 bytes of the inode
|
|
structure -- inode change time (ctime), access time (atime), data
|
|
modification time (mtime), and deletion time (dtime). The four fields
|
|
are 32-bit signed integers that represent seconds since the Unix epoch
|
|
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
|
|
January 2038. If the filesystem does not have orphan_file feature, inodes
|
|
that are not linked from any directory but are still open (orphan inodes) have
|
|
the dtime field overloaded for use with the orphan list. The superblock field
|
|
``s_last_orphan`` points to the first inode in the orphan list; dtime is then
|
|
the number of the next orphaned inode, or zero if there are no more orphans.
|
|
|
|
If the inode structure size ``sb->s_inode_size`` is larger than 128
|
|
bytes and the ``i_inode_extra`` field is large enough to encompass the
|
|
respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
|
|
inode fields are widened to 64 bits. Within this “extra” 32-bit field,
|
|
the lower two bits are used to extend the 32-bit seconds field to be 34
|
|
bit wide; the upper 30 bits are used to provide nanosecond timestamp
|
|
accuracy. Therefore, timestamps should not overflow until May 2446.
|
|
dtime was not widened. There is also a fifth timestamp to record inode
|
|
creation time (crtime); this field is 64-bits wide and decoded in the
|
|
same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
|
|
through the regular stat() interface, though debugfs will report them.
|
|
|
|
We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
|
|
In other words:
|
|
|
|
.. list-table::
|
|
:widths: 20 20 20 20 20
|
|
:header-rows: 1
|
|
|
|
* - Extra epoch bits
|
|
- MSB of 32-bit time
|
|
- Adjustment for signed 32-bit to 64-bit tv_sec
|
|
- Decoded 64-bit tv_sec
|
|
- valid time range
|
|
* - 0 0
|
|
- 1
|
|
- 0
|
|
- ``-0x80000000 - -0x00000001``
|
|
- 1901-12-13 to 1969-12-31
|
|
* - 0 0
|
|
- 0
|
|
- 0
|
|
- ``0x000000000 - 0x07fffffff``
|
|
- 1970-01-01 to 2038-01-19
|
|
* - 0 1
|
|
- 1
|
|
- 0x100000000
|
|
- ``0x080000000 - 0x0ffffffff``
|
|
- 2038-01-19 to 2106-02-07
|
|
* - 0 1
|
|
- 0
|
|
- 0x100000000
|
|
- ``0x100000000 - 0x17fffffff``
|
|
- 2106-02-07 to 2174-02-25
|
|
* - 1 0
|
|
- 1
|
|
- 0x200000000
|
|
- ``0x180000000 - 0x1ffffffff``
|
|
- 2174-02-25 to 2242-03-16
|
|
* - 1 0
|
|
- 0
|
|
- 0x200000000
|
|
- ``0x200000000 - 0x27fffffff``
|
|
- 2242-03-16 to 2310-04-04
|
|
* - 1 1
|
|
- 1
|
|
- 0x300000000
|
|
- ``0x280000000 - 0x2ffffffff``
|
|
- 2310-04-04 to 2378-04-22
|
|
* - 1 1
|
|
- 0
|
|
- 0x300000000
|
|
- ``0x300000000 - 0x37fffffff``
|
|
- 2378-04-22 to 2446-05-10
|
|
|
|
This is a somewhat odd encoding since there are effectively seven times
|
|
as many positive values as negative values. There have also been
|
|
long-standing bugs decoding and encoding dates beyond 2038, which don't
|
|
seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
|
|
incorrectly use the extra epoch bits 1,1 for dates between 1901 and
|
|
1970. At some point the kernel will be fixed and e2fsck will fix this
|
|
situation, assuming that it is run before 2310.
|