docs/architecture-core.md: New file

This is long overdue.  Some of this came up in recent
conversation.  Let's keep up some continual background
momentum on documentation, just like CI.

Co-authored-by: Jonathan Lebon <jonathan@jlebon.com>
This commit is contained in:
Colin Walters 2021-02-23 23:03:09 +00:00 committed by OpenShift Merge Robot
parent 5a79ca9035
commit 927e02100f

186
docs/architecture-core.md Normal file
View File

@ -0,0 +1,186 @@
---
nav_order: 5
---
# Architecture: RPM packages, ostree commits
{: .no_toc }
1. TOC
{:toc}
## RPMs + config -> single OSTree commit
On the compose side, to generate the "base image" the core idea
is that we take a set of packages as input, along with other configuration
and data and generate a single OSTree commit - a versioned filesystem
tree.
The same is also true on the client side, but it starts from a "base commit".
This document will describe the "core" phases and steps in that
process that apply both build/compose side and client side.
## Philosophy: Every change is "from scratch"
For every change today, rpm-ostree generally rebuilds the target filesystem
"from scratch" - adding accurate caching where needed. The goal is to
avoid [hysteresis](https://en.wikipedia.org/wiki/Hysteresis) ([related blog](https://blog.verbum.org/2020/08/22/immutable-%E2%86%92-reprovisionable-anti-hysteresis/)).
In other words if e.g. you `rpm-ostree install foo` and then `rpm-ostree install bar`,
the new target filesystem tree will be regenerated "from scratch" and
all RPM `%post` scripts etc. will rerun.
## Overall architecture:
- For each package, download and import into OSTree commit if necessary
- Unpack the "base" filesystem tree if any via hardlinks
- Determine an installation order, and unpack each package-ostree commit
again via hardlinks
- Run all the `%post` scripts (in install order)
- Run all the `%posttrans` scripts (in install order)
- Write RPM database (if we had a "base commit", starting from that)
- If initramfs regeneration is enabled or the kernel was replaced,
remove the base initramfs and run `dracut` to generate a new one.
- Ask libostree to commit the resulting filesystem tree, optimized
by a (device, inode) -> checksum cache, so that files that weren't
changed aren't re-checksummed.
### Generating the filesystem tree
In contrast to the above, traditional package managers like RPM are usually implemented in
a flow that does:
- Unpack package A
- Run `%post` script for A
- Update the package database with metadata for A
- Unpack package B
- Run `%post` script for B
- Update the package database with metadata for B
- ...
- Run `%posttrans` scripts
etc.
In contrast, rpm-ostree maintains an OSTree commit corresponding
to each RPM package provided as input. On a client system,
you can see this in e.g. `ostree refs | grep rpmostree/pkg`
(assuming you have layered packages). On a build system,
these ostree commits will be stored in a repo at
`pkgcache-repo/` within the cache directory.
This acts as an optimized cache for regenerating the target
root filesystem. So for rpm-ostree, the phase is more like this:
- Unpack the filesystem tree for all packages
- Run all the `%post` scripts
- Run all the `%posttrans` scripts
...
rpm-ostree is effectively reimplementing large chunks of
the librpm userspace in order to make it use OSTree natively.
### Sandboxing scripts
On the build server side, it's obviously desirable to
have the "build" of an ostree commit for a target system
not affect the running host.
Similarly, on the client side, the default is to provide
"offline" updates that don't affect the running system.
As part of this, rpm-ostree currently uses the
[bubblewrap](https://github.com/containers/bubblewrap/)
tool to run each script in its own isolated container.
Today, scripts are run with real uid 0 (not in a user namespace),
but we [drop most capabilities](https://github.com/coreos/rpm-ostree/pull/1099).
Additionally, scripts can't see the real host root filesystem,
most notably they do not see the real `/var` with all of the
system data. A good example of the benefit of this is
["tests: Add a test case for a %post that does rm -rf /"](https://github.com/coreos/rpm-ostree/pull/888).
In addition to bubblewrap, rpm-ostree uses `rofiles-fuse`
from the ostree project which originally enforced the model that
a file that has multiple hardlinks is read-only, but
more recently gained `--copyup` support which acts
in a similar fashion to the in-kernel `overlayfs`.
(See also https://github.com/ostreedev/ostree/issues/2281)
### Kernel handling
ostree is entirely oriented around bootable filesystem trees;
its "source of truth" is the bootloader entries. It has opinions
about where the Linux kernel binaries are stored (the current
standard is in `/usr/lib/modules/$kver`.)
In contrast, traditional RPM is unaware of what a kernel is; it's
just another package. Most higher level package managers such
as yum gained some special casing around the kernel - because
it's not possible to restart the running kernel, traditional RPM
systems need to keep the kernel modules for the running kernel
around. For example yum/dnf have a concept of "installonlyn"
which defaults to 2 for the kernel package.
Additionally, for at least traditional Fedora derivatives with
yum/dnf, the initramfs is generated client side as part of
a kernel update.
But for rpm-ostree, the decision was made to default to a
pre-generated initramfs by default. Further, in order to implement transactional
upgrades, rpm-ostree needs to be in control of the initramfs
regeneration - it can't just be a script forked off without its
knowledge.
Further for rpm-ostree, easily replacing the kernel (as well as userspace)
is intended to be a first-class operation; you need to be able to do that
in order to debug production issues for example.
In contrast to the yum/dnf "installonly" for ostree there can be exactly one kernel per userspace
filesystem tree. To ostree, a "bootable ostree commit"
is the pair of (kernel, userspace).
rpm-ostree combines these two worlds, and goes to some
lengths to bend the libdnf stack to work this way. We reset
the "installonly" limit back to 1 to ensure we have exactly
one kernel. PR: https://github.com/coreos/rpm-ostree/pull/1228
Further, as noted above rpm-ostree takes over the handling of
invoking `dracut` - just like other scripts, it is run inside
a container with just read-only access to the system. `dracut`
generates the initramfs CPIO archive, which we then place inside
the `/usr/lib/modules/$kver` location.
If client-side initramfs regeneration is enabled, we may selectively
provide desired configuration files into this process. PR: https://github.com/coreos/rpm-ostree/pull/2170
### SELinux
Handling SELinux is very tricky, because it is a package that can affect
*every other package*. Specifically, the SELinux policy package
contains a vast set of regular expressions in `file_contexts`
to determine labeling.
For traditional librpm, this is a plugin.
A major goal of OSTree from the start has been to ensure fully correct
handling of SELinux for the base operating system. The way
rpm-ostree handles this is by:
- Recompiling the policy as a `%posttrans` equivalent
- Loading the policy from the target root, and pass that loaded policy
to libostree, which consults it to use for the label of each
committed file.
This means that on an OSTree based system, the labels for the
files in the booted deployment (e.g. in `/usr`) are always
correct and set atomically - there's no need to relabel.
#### SELinux policy storage location
Another major difference between traditional yum/dnf and
rpm-ostree based systems is the location of the SELinux
policy store database itself. rpm-ostree overrides it
to be back in `/etc`, when it was moved to `/var` in the
RPM package around the Fedora 24 timeframe. For more information
see https://bugzilla.redhat.com/show_bug.cgi?id=1290659
and the comments in `rpmostree-postprocess.cxx`.