rpm-ostree/design/jigdo.md
Colin Walters 694b798c73 Introduce experimental "rpm-ostree jigdo"
Tracking issue: https://github.com/projectatomic/rpm-ostree/issues/1081

To briefly recap: Let's experiment with doing ostree-in-RPM, basically the
"compose" process injects additional data (SELinux labels for example) in an
"ostree image" RPM, like `fedora-atomic-host-27.8-1.x86_64.rpm`. That "ostree
image" RPM will contain the OSTree commit+metadata, and tell us what RPMs we
need need to download. For updates, like `yum update` we only download changed
RPMs, plus the new "oirpm". But SELinux labeling, depsolving, etc. are still
done server side, and we still have a reliable OSTree commit checksum.

This is a lot like [Jigdo](http://atterer.org/jigdo/)

Here we fully demonstrate the concept working end-to-end; we use the
"traditional" `compose tree` to commit a bunch of RPMs to an OSTree repo, which
has a checksum, version etc. Then the new `ex commit2jigdo` generates the
"oirpm". This is the "server side" operation. Next simulating the client side,
`jigdo2commit` takes the OIRPM and uses it and downloads the "jigdo set" RPMs,
fully regenerating *bit for bit* the final OSTree commit.

If you want to play with this, I'd take a look at the `test-jigdo.sh`; from
there you can find other useful bits like the example `fedora-atomic-host.spec`
file (though the canonical copy of this will likely land in the
[fedora-atomic](http://pagure.io/fedora-atomic) manifest git repo.

Closes: #1103
Approved by: jlebon
2017-12-04 14:24:53 +00:00

98 lines
4.9 KiB
Markdown

Introducing rpm-ostree jigdo
--------
In the rpm-ostree project, we're blending an image system (libostree)
with a package system (libdnf). The goal is to gain the
advantages of both. However, the dual nature also brings overhead;
this proposal aims to reduce some of that by adding a new "jigdo"
model to rpm-ostree that makes more operations use the libdnf side.
To do this, we're reviving an old idea: The [http://atterer.org/jigdo/](Jigdo)
approach to reassembling large "images" by downloading component packages. (We're
not using the code, just the idea).
In this approach, we're still maintaining the "image" model of libostree. When
one deploys an OSTree commit, it will reliably be bit-for-bit identical. It will
have a checksum and a version number. There will be *no* dependency resolution
on the client by default, etc.
The change is that we always use libdnf to download RPM packages as they exist
today, storing any additional data inside a new "ostree-image" RPM. In this
proposal, rather than using ostree branches, the system tracks an "ostree-image"
RPM that behaves a bit like a "metapackage".
Why?
----
The "dual" nature of the system appears in many ways; users and administrators
effectively need to understand and manage both systems.
An example is when one needs to mirror content. While libostree does support
mirroring, and projects like Pulp make use of it, support is not as widespread
as mirroring for RPM. And mirroring is absolutely critical for many
organizations that don't want to depend on Internet availability.
Related to this is the mapping of libostree "branches" and rpm-md repos. In
Fedora we offer multiple branches for Atomic Host, such as
`fedora/27/x86_64/atomic-host` as well as
`fedora/27/x86_64/testing/atomic-host`, where the latter is equivalent to `yum
--enablerepo=updates-testing update`. In many ways, I believe the way we're
exposing as OSTree branches is actually nicer - it's very clear when you're on
the testing branch.
However, it's also very *different* from the yum/dnf model. Once package
layering is involved (and for a lot of small scale use cases it will be,
particularly for desktop systems), the libostree side is something that many
users and administrators have to learn *in addition* to their previous "mental model"
of how the libdnf/yum/rpm side works with `/etc/yum.repos.d` etc.
Finally, for network efficiency; on the wire, libostree has two formats, and the
intention is that most updates hit the network-efficient static delta path, but
there are various cases where this doesn't happen, such as if one is skipping a
few updates, or today when rebasing between branches. In practice, as soon as
one involves libdnf, the repodata is already large enough that it's not worth
trying to optimize fetching content over simply redownloading changed RPMs.
(Aside: people doing custom systems tend to like the network efficiency of "pure
ostree" where one doesn't pay the "repodata cost" and we will continue to
support that.)
How?
----
We've already stated that a primary design goal is to preserve the "image"
functionality by default. Further, let's assume that we have an OSTree commit,
and we want to point it at a set of RPMs to use as the jigdo source. The source
OSTree commit can have modified, added to, or removed data from the RPM set, and
we will support that. Examples of additional data are the initramfs and RPM
database.
We're hence treating the RPM set as just data blobs; again, no dependency
resolution, `%post` scripts or the like will be executed on the client. Or again
to state this more strongly, installation will still result in an OSTree commit
with checksum that is bit-for-bit identical.
A simple approach is to scan over the set of files in the RPMs, then the set
of files in the OSTree commit, and add RPMs which contain files in the OSTree
commit to our "jigdo set".
However, a major complication is SELinux labeling. It turns out that in a lot of
cases, doing SELinux labeling is expensive; there are thousands of regular
expressions involved. However, RPM packages themselves don't contain labels;
instead the labeling is stored in the `selinux-policy-targeted` package, and
further complicating things is that there are also other packages that add
labeling configuration such as `container-selinux`. In other words there's a
circular dependency: packages have labels, but labels are contained in packages.
We go to great lengths to handle this in rpm-ostree for package layering, and we
need to do the same for jigdo.
We can address this by having our OIRPM contain a mapping of (package, file
path) to a set of extended attributes (including the key `security.selinux`
one).
At this point, if we add in the new objects such as the metadata objects from
the OSTree commit and all new content objects that aren't part of a package,
we'll have our OIRPM. (There is
some [further complexity](https://pagure.io/fedora-atomic/issue/94) around
handling the initramfs and SELinux labeling that we'll omit for now).