mirror of
https://github.com/systemd/systemd.git
synced 2025-01-10 05:18:17 +03:00
docs: document the new journal file format additions
This commit is contained in:
parent
bbcd38e41e
commit
70cd1e561c
@ -59,9 +59,9 @@ in particular realize that they may include binary non-text data (though
|
|||||||
usually don't), and the same field might have multiple values assigned within
|
usually don't), and the same field might have multiple values assigned within
|
||||||
the same entry.
|
the same entry.
|
||||||
|
|
||||||
This document describes the current format of systemd 195. The documented
|
This document describes the current format of systemd 246. The documented
|
||||||
format is compatible with the format used in the first versions of the journal,
|
format is compatible with the format used in the first versions of the journal,
|
||||||
but received various compatible additions since.
|
but received various compatible and incompatible additions since.
|
||||||
|
|
||||||
If you are wondering why the journal file format has been created in the first
|
If you are wondering why the journal file format has been created in the first
|
||||||
place instead of adopting an existing database implementation, please have a
|
place instead of adopting an existing database implementation, please have a
|
||||||
@ -73,7 +73,7 @@ thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054
|
|||||||
|
|
||||||
* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
|
* All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format.
|
||||||
* Offsets are always relative to the beginning of the file.
|
* Offsets are always relative to the beginning of the file.
|
||||||
* The 64bit hash function used is [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function), more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
|
* The 64bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part.
|
||||||
* All structures are aligned to 64bit boundaries and padded to multiples of 64bit
|
* All structures are aligned to 64bit boundaries and padded to multiples of 64bit
|
||||||
* The format is designed to be read and written via memory mapping using multiple mapped windows.
|
* The format is designed to be read and written via memory mapping using multiple mapped windows.
|
||||||
* All time values are stored in usec since the respective epoch.
|
* All time values are stored in usec since the respective epoch.
|
||||||
@ -174,6 +174,9 @@ _packed_ struct Header {
|
|||||||
/* Added in 189 */
|
/* Added in 189 */
|
||||||
le64_t n_tags;
|
le64_t n_tags;
|
||||||
le64_t n_entry_arrays;
|
le64_t n_entry_arrays;
|
||||||
|
/* Added in 246 */
|
||||||
|
le64_t data_hash_chain_depth;
|
||||||
|
le64_t field_hash_chain_depth;
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -218,6 +221,16 @@ entry has been written yet.
|
|||||||
**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
|
**tail_entry_monotonic** is the monotonic timestamp of the last entry in the
|
||||||
file, referring to monotonic time of the boot identified by **boot_id**.
|
file, referring to monotonic time of the boot identified by **boot_id**.
|
||||||
|
|
||||||
|
**data_hash_chain_depth** is a counter of the deepest chain in the data hash
|
||||||
|
table, minus one. This is updated whenever a chain is found that is longer than
|
||||||
|
the previous deepest chain found. Note that the counter is updated during hash
|
||||||
|
table lookups, as the chains are traversed. This counter is used to determine
|
||||||
|
when it is a good time to rotate the journal file, because hash collisions
|
||||||
|
became too frequent.
|
||||||
|
|
||||||
|
Similar, **field_hash_chain_depth** is a counter of the deepest chain in the
|
||||||
|
field hash table, minus one.
|
||||||
|
|
||||||
|
|
||||||
## Extensibility
|
## Extensibility
|
||||||
|
|
||||||
@ -238,20 +251,30 @@ unconditionally exist in all revisions of the file format, all fields starting
|
|||||||
with "n_data" needs to be explicitly checked for via a size check, since they
|
with "n_data" needs to be explicitly checked for via a size check, since they
|
||||||
were additions after the initial release.
|
were additions after the initial release.
|
||||||
|
|
||||||
Currently only two extensions flagged in the flags fields are known:
|
Currently only five extensions flagged in the flags fields are known:
|
||||||
|
|
||||||
```c
|
```c
|
||||||
enum {
|
enum {
|
||||||
HEADER_INCOMPATIBLE_COMPRESSED = 1
|
HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0,
|
||||||
|
HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1,
|
||||||
|
HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2,
|
||||||
|
HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3,
|
||||||
};
|
};
|
||||||
|
|
||||||
enum {
|
enum {
|
||||||
HEADER_COMPATIBLE_SEALED = 1
|
HEADER_COMPATIBLE_SEALED = 1 << 0,
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
HEADER_INCOMPATIBLE_COMPRESSED indicates that the file includes DATA objects
|
HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects
|
||||||
that are compressed using XZ.
|
that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4
|
||||||
|
indicates that the file includes DATA objects that are compressed with the LZ4
|
||||||
|
algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are
|
||||||
|
objects compressed with ZSTD.
|
||||||
|
|
||||||
|
HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins
|
||||||
|
hash function the keyed siphash24 hash function is used for the two hash
|
||||||
|
tables, see below.
|
||||||
|
|
||||||
HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
|
HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required
|
||||||
for Forward Secure Sealing.
|
for Forward Secure Sealing.
|
||||||
@ -308,9 +331,9 @@ structure gracefully. (Checking what you read is a pretty good idea out of
|
|||||||
security considerations anyway.) This specifically includes checking offset
|
security considerations anyway.) This specifically includes checking offset
|
||||||
values, and that they point to valid objects, with valid sizes and of the type
|
values, and that they point to valid objects, with valid sizes and of the type
|
||||||
and hash value expected. All code must be written with the fact in mind that a
|
and hash value expected. All code must be written with the fact in mind that a
|
||||||
file with inconsistent structure file might just be inconsistent temporarily,
|
file with inconsistent structure might just be inconsistent temporarily, and
|
||||||
and might become consistent later on. Payload OTOH requires less scrutiny, as
|
might become consistent later on. Payload OTOH requires less scrutiny, as it
|
||||||
it should only be linked up (and hence visible to readers) after it was
|
should only be linked up (and hence visible to readers) after it was
|
||||||
successfully written to memory (though not necessarily to disk). On non-local
|
successfully written to memory (though not necessarily to disk). On non-local
|
||||||
file systems it is a good idea to verify the payload hashes when reading, in
|
file systems it is a good idea to verify the payload hashes when reading, in
|
||||||
order to avoid annoyances with mmap() inconsistencies.
|
order to avoid annoyances with mmap() inconsistencies.
|
||||||
@ -319,8 +342,8 @@ Clients intending to show a live view of the journal should use inotify() for
|
|||||||
this to watch for files changes. Since file writes done via mmap() do not
|
this to watch for files changes. Since file writes done via mmap() do not
|
||||||
result in inotify() writers shall truncate the file to its current size after
|
result in inotify() writers shall truncate the file to its current size after
|
||||||
writing one or more entries, which results in inotify events being
|
writing one or more entries, which results in inotify events being
|
||||||
generated. Note that this is not used as transaction scheme (it doesn't protect
|
generated. Note that this is not used as a transaction scheme (it doesn't
|
||||||
anything), but merely for triggering wakeups.
|
protect anything), but merely for triggering wakeups.
|
||||||
|
|
||||||
Note that inotify will not work on network file systems if reader and writer
|
Note that inotify will not work on network file systems if reader and writer
|
||||||
reside on different hosts. Readers which detect they are run on journal files
|
reside on different hosts. Readers which detect they are run on journal files
|
||||||
@ -334,7 +357,9 @@ All objects carry a common header:
|
|||||||
|
|
||||||
```c
|
```c
|
||||||
enum {
|
enum {
|
||||||
OBJECT_COMPRESSED = 1
|
OBJECT_COMPRESSED_XZ = 1 << 0,
|
||||||
|
OBJECT_COMPRESSED_LZ4 = 1 << 1,
|
||||||
|
OBJECT_COMPRESSED_ZSTD = 1 << 2,
|
||||||
};
|
};
|
||||||
|
|
||||||
_packed_ struct ObjectHeader {
|
_packed_ struct ObjectHeader {
|
||||||
@ -346,10 +371,13 @@ _packed_ struct ObjectHeader {
|
|||||||
};
|
};
|
||||||
|
|
||||||
The **type** field is one of the object types listed above. The **flags** field
|
The **type** field is one of the object types listed above. The **flags** field
|
||||||
currently knows one flag: OBJECT_COMPRESSED. It is only valid for DATA objects
|
currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and
|
||||||
and indicates that the data payload is compressed with XZ. If OBJECT_COMPRESSED
|
OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that
|
||||||
is set for an object HEADER_INCOMPATIBLE_COMPRESSED must be set for the file as
|
the data payload is compressed with XZ/LZ4/ZSTD. If one of the
|
||||||
well. The **size** field encodes the size of the object including all its
|
OBJECT_COMPRESSED_* flags is set for an object then the matching
|
||||||
|
HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD
|
||||||
|
flag must be set for the file as well. At most one of these three bits may be
|
||||||
|
set. The **size** field encodes the size of the object including all its
|
||||||
headers and payload.
|
headers and payload.
|
||||||
|
|
||||||
|
|
||||||
@ -371,7 +399,12 @@ _packed_ struct DataObject {
|
|||||||
Data objects carry actual field data in the **payload[]** array, including a
|
Data objects carry actual field data in the **payload[]** array, including a
|
||||||
field name, a '=' and the field data. Example:
|
field name, a '=' and the field data. Example:
|
||||||
`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
|
`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the
|
||||||
payload.
|
payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header
|
||||||
|
this is the siphash24 hash value of the payload, keyed by the file ID as stored
|
||||||
|
in the `.file_id` field of the file header. If the flag is not set it is the
|
||||||
|
non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as
|
||||||
|
it makes the format more robust against attackers that want to trigger hash
|
||||||
|
collisions in the hash table.
|
||||||
|
|
||||||
**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
|
**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if
|
||||||
a hash collision happens (in a singly linked list, with an offset of 0
|
a hash collision happens (in a singly linked list, with an offset of 0
|
||||||
@ -388,8 +421,9 @@ number of ENTRY objects that reference this object, i.e. the sum of all
|
|||||||
ENTRY_ARRAYS chained up from this object, plus 1.
|
ENTRY_ARRAYS chained up from this object, plus 1.
|
||||||
|
|
||||||
The **payload[]** field contains the field name and date unencoded, unless
|
The **payload[]** field contains the field name and date unencoded, unless
|
||||||
OBJECT_COMPRESSED is set in the `ObjectHeader`, in which case the payload is
|
OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the
|
||||||
LZMA compressed.
|
`ObjectHeader`, in which case the payload is compressed with the indicated
|
||||||
|
compression algorithm.
|
||||||
|
|
||||||
|
|
||||||
## Field Objects
|
## Field Objects
|
||||||
@ -448,10 +482,17 @@ identified by **boot_id**.
|
|||||||
The **xor_hash** field contains a binary XOR of the hashes of the payload of
|
The **xor_hash** field contains a binary XOR of the hashes of the payload of
|
||||||
all DATA objects referenced by this ENTRY. This value is usable to check the
|
all DATA objects referenced by this ENTRY. This value is usable to check the
|
||||||
contents of the entry, being independent of the order of the DATA objects in
|
contents of the entry, being independent of the order of the DATA objects in
|
||||||
the array.
|
the array. Note that even for files that have the
|
||||||
|
`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise
|
||||||
|
used hash function) the hash function used for this field, as singular
|
||||||
|
exception, is the Jenkins lookup3 hash function. The XOR hash value is used to
|
||||||
|
quickly compare the contents of two entries, and to define a well-defined order
|
||||||
|
between two entries that otherwise have the same sequence numbers and
|
||||||
|
timestamps.
|
||||||
|
|
||||||
The **items[]** array contains references to all DATA objects of this entry,
|
The **items[]** array contains references to all DATA objects of this entry,
|
||||||
plus their respective hashes.
|
plus their respective hashes (which are calculated the same way as in the DATA
|
||||||
|
objects, i.e. keyed by the file ID).
|
||||||
|
|
||||||
In the file ENTRY objects are written ordered monotonically by sequence
|
In the file ENTRY objects are written ordered monotonically by sequence
|
||||||
number. For continuous parts of the file written during the same boot
|
number. For continuous parts of the file written during the same boot
|
||||||
@ -494,8 +535,8 @@ and create a new one.
|
|||||||
|
|
||||||
The DATA_HASH_TABLE should be sized taking into account to the maximum size the
|
The DATA_HASH_TABLE should be sized taking into account to the maximum size the
|
||||||
file is expected to grow, as configured by the administrator or disk space
|
file is expected to grow, as configured by the administrator or disk space
|
||||||
considerations. The FIELD_HASH_TABLE should be sized to a fixed size, as the
|
considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the
|
||||||
number of fields should be pretty static it depends only on developers'
|
number of fields should be pretty static as it depends only on developers'
|
||||||
creativity rather than runtime parameters.
|
creativity rather than runtime parameters.
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user