rpm-ostree/design/jigdo.md
Colin Walters 694b798c73 Introduce experimental "rpm-ostree jigdo"
Tracking issue: https://github.com/projectatomic/rpm-ostree/issues/1081

To briefly recap: Let's experiment with doing ostree-in-RPM, basically the
"compose" process injects additional data (SELinux labels for example) in an
"ostree image" RPM, like `fedora-atomic-host-27.8-1.x86_64.rpm`. That "ostree
image" RPM will contain the OSTree commit+metadata, and tell us what RPMs we
need need to download. For updates, like `yum update` we only download changed
RPMs, plus the new "oirpm". But SELinux labeling, depsolving, etc. are still
done server side, and we still have a reliable OSTree commit checksum.

This is a lot like [Jigdo](http://atterer.org/jigdo/)

Here we fully demonstrate the concept working end-to-end; we use the
"traditional" `compose tree` to commit a bunch of RPMs to an OSTree repo, which
has a checksum, version etc. Then the new `ex commit2jigdo` generates the
"oirpm". This is the "server side" operation. Next simulating the client side,
`jigdo2commit` takes the OIRPM and uses it and downloads the "jigdo set" RPMs,
fully regenerating *bit for bit* the final OSTree commit.

If you want to play with this, I'd take a look at the `test-jigdo.sh`; from
there you can find other useful bits like the example `fedora-atomic-host.spec`
file (though the canonical copy of this will likely land in the
[fedora-atomic](http://pagure.io/fedora-atomic) manifest git repo.

Closes: #1103
Approved by: jlebon
2017-12-04 14:24:53 +00:00

4.9 KiB

Introducing rpm-ostree jigdo

In the rpm-ostree project, we're blending an image system (libostree) with a package system (libdnf). The goal is to gain the advantages of both. However, the dual nature also brings overhead; this proposal aims to reduce some of that by adding a new "jigdo" model to rpm-ostree that makes more operations use the libdnf side.

To do this, we're reviving an old idea: The http://atterer.org/jigdo/ approach to reassembling large "images" by downloading component packages. (We're not using the code, just the idea).

In this approach, we're still maintaining the "image" model of libostree. When one deploys an OSTree commit, it will reliably be bit-for-bit identical. It will have a checksum and a version number. There will be no dependency resolution on the client by default, etc.

The change is that we always use libdnf to download RPM packages as they exist today, storing any additional data inside a new "ostree-image" RPM. In this proposal, rather than using ostree branches, the system tracks an "ostree-image" RPM that behaves a bit like a "metapackage".

Why?

The "dual" nature of the system appears in many ways; users and administrators effectively need to understand and manage both systems.

An example is when one needs to mirror content. While libostree does support mirroring, and projects like Pulp make use of it, support is not as widespread as mirroring for RPM. And mirroring is absolutely critical for many organizations that don't want to depend on Internet availability.

Related to this is the mapping of libostree "branches" and rpm-md repos. In Fedora we offer multiple branches for Atomic Host, such as fedora/27/x86_64/atomic-host as well as fedora/27/x86_64/testing/atomic-host, where the latter is equivalent to yum --enablerepo=updates-testing update. In many ways, I believe the way we're exposing as OSTree branches is actually nicer - it's very clear when you're on the testing branch.

However, it's also very different from the yum/dnf model. Once package layering is involved (and for a lot of small scale use cases it will be, particularly for desktop systems), the libostree side is something that many users and administrators have to learn in addition to their previous "mental model" of how the libdnf/yum/rpm side works with /etc/yum.repos.d etc.

Finally, for network efficiency; on the wire, libostree has two formats, and the intention is that most updates hit the network-efficient static delta path, but there are various cases where this doesn't happen, such as if one is skipping a few updates, or today when rebasing between branches. In practice, as soon as one involves libdnf, the repodata is already large enough that it's not worth trying to optimize fetching content over simply redownloading changed RPMs.

(Aside: people doing custom systems tend to like the network efficiency of "pure ostree" where one doesn't pay the "repodata cost" and we will continue to support that.)

How?

We've already stated that a primary design goal is to preserve the "image" functionality by default. Further, let's assume that we have an OSTree commit, and we want to point it at a set of RPMs to use as the jigdo source. The source OSTree commit can have modified, added to, or removed data from the RPM set, and we will support that. Examples of additional data are the initramfs and RPM database.

We're hence treating the RPM set as just data blobs; again, no dependency resolution, %post scripts or the like will be executed on the client. Or again to state this more strongly, installation will still result in an OSTree commit with checksum that is bit-for-bit identical.

A simple approach is to scan over the set of files in the RPMs, then the set of files in the OSTree commit, and add RPMs which contain files in the OSTree commit to our "jigdo set".

However, a major complication is SELinux labeling. It turns out that in a lot of cases, doing SELinux labeling is expensive; there are thousands of regular expressions involved. However, RPM packages themselves don't contain labels; instead the labeling is stored in the selinux-policy-targeted package, and further complicating things is that there are also other packages that add labeling configuration such as container-selinux. In other words there's a circular dependency: packages have labels, but labels are contained in packages. We go to great lengths to handle this in rpm-ostree for package layering, and we need to do the same for jigdo.

We can address this by having our OIRPM contain a mapping of (package, file path) to a set of extended attributes (including the key security.selinux one).

At this point, if we add in the new objects such as the metadata objects from the OSTree commit and all new content objects that aren't part of a package, we'll have our OIRPM. (There is some further complexity around handling the initramfs and SELinux labeling that we'll omit for now).