IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Back in 2b8d586c5, /sysroot was changed to be a private mount so that
submounts of /var do not propagate back to the stateroot /var. That's
laudible, but it makes /sysroot different than every other shared mount
in the root namespace. In particular, it means that submounts of
/sysroot do not propagate into separate mount namespaces.
Rather than make /sysroot private, make /var a slave+shared mount so
that it receives mount events from /sysroot but not vice versa. That
achieves the same effect of preventing /var submount events from
propagating back to /sysroot while allowing /sysroot mount events to
propagate forward like every other system mount. See
mount_namespaces(7)[1] and the linux shared subtrees[2] documentation
for details on slave+shared mount propagation.
When /var is mounted in the initramfs, this is accomplished with
mount(2) syscalls. When /var is mounted after switching to the real
root, the mount propagation flags are applied as options in the
generated var.mount unit. This depends on a mount(8) feature that has
been present since util-linux 2.23. That's available in RHEL 7 and every
non-EOL Debian and Ubuntu release. Applying the propagation from
var.mount fixes a small race, too. Previously, if a /var submount was
added before /sysroot was made private, it would have propagated back
into /sysroot. That was possible since ostree-remount.service orders
itself after var.mount but not before any /var submounts.
1. https://man7.org/linux/man-pages/man7/mount_namespaces.7.html
2. https://docs.kernel.org/filesystems/sharedsubtree.htmlFixes: #2086
This tests the current behavior of making /sysroot a private mount so
that submounts on /var do not propagate back to /sysroot. It also shows
how submounts of /sysroot do not propagate into separate mount
namespaces for the same reason.
This builds on top of fa9924e4fe
(But in a very hacky way because we don't currently link to a JSON library)
Basically, bootupd supports injecting static configs, and this
is the currently least hacky way for us to detect this and understand
that we shouldn't try to run `grub2-mkconfig`.
A further patch I'd like to do here is also change the probing
logic to gracefully no-op if `grub2-mkconfig` doesn't exist,
but that has a bit more risk and involvement.
We've long struggled with semantics for `/var`. Our stance of
"/var should start out empty and be managed by the OS" is a strict
one, that pushes things closer to the original systemd upstream
ideal of the "OS state is in /usr".
However...well, a few things. First, we had some legacy bits
here which were always populating the deployment `/var`. I don't
think we need that if systemd is in use, so detect if the tree
has `usr/lib/tmpfiles.d`, and don't create that stuff at
`ostree admin stateroot-init` time if so.
Building on that then, we have the stateroot `var` starting out
actually empty.
When we do a deployment, if the stateroot `var` is empty,
make a copy (reflink if possible of course) of the commit's `/var`
into it.
This matches the semantics that Docker created with volumes,
and this is sufficiently simple and easy to explain that I think
it's closer to the right thing to do.
Crucially...it's just really handy to have some pre-existing
directories in `/var` in container images, because Docker (and podman/kube/etc)
don't run systemd and hence don't run `tmpfiles.d` on startup.
I really hit on the fact that we need `/var/tmp` in our container
images by default for example.
So there's still some overlap here with e.g. `/usr/lib/tmpfiles.d/var.conf`
as shipped by systemd, but that's fine - they don't actually conflict
per se.
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.
More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).
Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.
In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.
Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.
For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).
See discussions in https://github.com/ostreedev/ostree/issues/3113 for
more details.
When we estimate how much space a new bootcsum dir will use, we
weren't accounting for the space overhead from files not using the
last filesystem block completely. This doesn't matter much if counting
a few files, but e.g. on FCOS aarch64, we include lots of small
devicetree blobs in the bootfs. That loss can add up to enough for the
`fallocate()` check to pass but copying still hitting `ENOSPC` later on.
I think a better fix here is to change approach entirely and instead
refactor `install_deployment_kernel()` so that we can call just the
copying bits of it as part of the early prune logic. We'll get a more
accurate assessment and it's not lost work since we won't need to
recopy later on. Also this would not require having to keep in sync the
estimator and the install bits.
That said, this is blocking FCOS releases, so I went with a more tactical
fix for now.
Fixes: https://github.com/coreos/fedora-coreos-tracker/issues/1637
I think we originally used to do this, but at some point in a
code refactoring, this optimization got lost.
It's a quite important optimization for the case of writing content
generated by an external system into an ostree repository.
This is a pattern we want to encourage. It's honestly just
way simpler than what rpm-ostree is doing today in auto-synthesizing
individual tmpfiles.d snippets.
It's about time we do this; deployment finalization locking
is a useful feature. An absolutely key thing here is that
we've slowly been moving towards the deployments as the primary
"source of truth".
Specifically in bootc for example, we will GC container images
not referenced by a deployment.
This is then neecessary to support a "pull but don't apply automatically" model.
This stabilizes the existing `ostree admin deploy --lock-finalization`
CLI, and adds a new `ostree admin unlock-finalization`.
We still check the old lock file path, but there's a new boolean
value as part of the staged deployment data which is intended
to be the source of truth in the future. At some point then we
can drop the rpm-ostree lockfile handling.
Closes: https://github.com/ostreedev/ostree/issues/3025
Right now `ostree admin status` errors out in this case, but
`rpm-ostree status` doesn't. The former behavior is probably
more of a bug, work around it for now.
There seems to be a tricky regression here with the util-linux
support for the new mount API, plus overlays support for it.
```
[2023-11-09T21:05:30.633Z] Nov 09 21:05:26 qemu0 kola-runext-unlock-transient.sh[2108]: + unshare -m -- /bin/sh -c 'mount -o remount,rw /usr && echo hello from transient unlock >/usr/share/writable-usr-test'
[2023-11-09T21:05:30.633Z] Nov 09 21:05:26 qemu0 kola-runext-unlock-transient.sh[2148]: mount: /usr: mount point not mounted or bad option.
[2023-11-09T21:05:30.633Z] Nov 09 21:05:26 qemu0 kola-runext-unlock-transient.sh[2148]: dmesg(1) may have more information after failed mount system call.
```
OK this seems related to the new mount API support in util-linux and overlayfs. From a strace:
```
2095 open_tree(AT_FDCWD, "/usr", OPEN_TREE_CLOEXEC) = 3
2095 mount_setattr(-1, NULL, 0, NULL, 0) = -1 EINVAL (Invalid argument)
...
2095 fspick(3, "", FSPICK_NO_AUTOMOUNT|FSPICK_EMPTY_PATH) = 4
2095 fsconfig(4, FSCONFIG_SET_FLAG, "seclabel", NULL, 0) = 0
2095 fsconfig(4, FSCONFIG_SET_STRING, "lowerdir", "usr", 0) = -1 EINVAL (Invalid argument)
```
I think the core problem here is it's trying to reconfigure the mount with existing options,
but in the new mount namespace we can't see the lowerdir.
Here we really really just want to remount writable. Telling
util-linux to not pass existing options fixes it.
This closes the biggest foot-gun when doing e.g.
`rpm-ostree rebase` when zincati is running on a FCOS system.
Previously if zincati happened to have staged + locked a deployment,
we'd keep around the lock which is definitely not what is desired.
This will be very useful for enabling a "transient /etc" option
because we won't have to do hacks relabling in the initramfs, or
forcing it on just for composefs.
The `f_bfree` member of the `statvfs` struct is documented as the
"number of free blocks". However, different filesystems have different
interpretations of this. E.g. on XFS, this is truly the number of blocks
free for allocating data. On ext4 however, it includes blocks that
are actually reserved by the filesystem and cannot be used for file
data. (Note this is separate from the distinction between `f_bfree` and
`f_bavail` which isn't relevant to us here since we're privileged.)
If a kernel and initrd is sized just right so that it's still within the
`f_bfree` limit but above what we can actually allocate, the early prune
code won't kick in since it'll think that there is enough space. So we
end up hitting `ENOSPC` when we actually copy the files in.
Rework the early prune code to instead use `fallocate` which guarantees
us that a file of a certain size can fit on the filesystem. `fallocate`
requires filesystem support, but all the filesystems we care about for
the bootfs support it (including even FAT).
(There's technically a TOCTOU race here that existed also with the
`statvfs` code where free space could change between when we check
and when we copy. Ideally we'd be able to pass down that fd to the
copying bits, but anyway in practice the bootfs is pretty much owned by
libostree and one doesn't expect concurrent writes during a finalization
operation.)
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]
Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.
This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.
Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.
This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.
[1]: https://github.com/coreos/fedora-coreos-tracker/issues/1247
[2]: https://github.com/ostreedev/ostree/issues/2670#issuecomment-1179341883Closes: #2670
When hacking and testing locally with `cosa build-fast` and `kola run`,
I prefer to leave testing framework stuff within the work directory
rather than installed in my pet container. Add a `localinstall` target
for this which puts the tests in `tests/kola`. Then a simple `kola run`
will pick it up.
XFS now seems to want filesystems larger than 300MB, so switch
to ext4. Also use `20MiB` so we align to 512b sectors to squash
a `losetup` warning.
Also tweak some of the numbers to still work.
Introduces an intermediate format for overlayfs storage, where
.wh-ostree. prefixed files will be converted into char 0:0
whiteout devices used by overlayfs to mark deletions across layers.
The CI scripts now uses a volume for the scratch directories
previously in /var/tmp otherwise we cannot create whiteout
devices into an overlayfs mounted filesystem.
Related-Issue: #2712
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.
To ensure that systemd doesn't stop the service early when the `/boot`
automount expires, introduce a new unit that holds `/boot` open until
it's sent `SIGTERM`. This uses a new `--hold` option for
`finalize-staged` that loads but doesn't lock the sysroot. A separate
unit is used since we want the process to remain active throughout the
finalization run in `ExecStop`. That wouldn't work if it was specified
in `ExecStart` in the same unit since it would be killed before the
`ExecStop` action was run.
Fixes: #2543
are pending
This is to support pending deployments instead of rasing assertion.
For example:
```
$ sudo rpm-ostree kargs --append=foo=bar
$ sudo ostree admin kargs edit-in-place --append-if-missing=foobar
```
After reboot we get both `foo=bar foobar`.
Fix https://github.com/ostreedev/ostree/issues/2679
Don't parse `rpm-ostree status` output, it's not meant for that. Use
`--json` output instead.
While we're here, fix an obsolete reference to Ansible.
Related: https://github.com/coreos/rpm-ostree/pull/3938
https://github.com/coreos/coreos-assembler/pull/2921 broke this
test which is intentionally causing a systemd unit to fail.
As they say, necessity is the mother of invention. They don't
say though that need always causes particularly *beautiful* things
to be invented...
Quite a while ago we added staged deployments, which solved
a bunch of issues around the `/etc` merge. However...a persistent
problem since then is that any failures in that process that
happened in the *previous* boot are not very visible.
We ship custom code in `rpm-ostree status` to query the previous
journal. But that has a few problems - one is that on systems
that have been up a while, that failure message may even get
rotated out. And second, some systems may not even have a persistent
journal at all.
A general thing we do in e.g. Fedora CoreOS testing is to check
for systemd unit failures. We do that both in our automated tests,
and we even ship code that displays them on ssh logins. And beyond
that obviously a lot of other projects do the same; it's easy via
`systemctl --failed`.
So to make failures more visible, change our `ostree-finalize-staged.service`
to have an internal wrapper around the process that "catches" any
errors, and copies the error message into a file in `/boot/ostree`.
Then, a new `ostree-boot-complete.service` looks for this file on
startup and re-emits the error message, and fails.
It also deletes the file. The rationale is to avoid *continually*
warning. For example we need to handle the case when an upgrade
process creates a new staged deployment. Now, we could change the
ostree core code to delete the warning file when that happens instead,
but this is trying to be a conservative change.
This should make failures here much more visible as is.
`kola` now follows symlinks when archiving an external test's `data/`
dir. So the recursive `data` symlink we have here breaks it.
Let's just move the shared files in its own directory and update the
symlinks.
Support for that file was added previously, but the testing lived in
rpm-ostree only. Let's add it here too.
In the process add a hidden `--lock-finalization` to `ostree admin
deploy` to make testing easier (though it could also be useful to update
managers driving OSTree via the CLI).
In fixing https://github.com/coreos/rpm-ostree/pull/3323
I felt that it was a bit ugly we're installing `/usr/bin/ostree-container`.
It's kind of an implementation detail. We want users to use
`ostree container`.
Let's support values outside of $PATH too.
For example, this also ensures that TAB completion for `ost` expands
to `ostree ` with a space.
This reworks the var-mount destructive test in order to properly use
the datadir for the current stateroot instead of a duplicated one.
In turn, it ensures that the resulting `var.mount` after reboot is
correctly pointing to the same location which hosted `/var` on the
previous boot.
https://bugzilla.redhat.com/show_bug.cgi?id=1945274 is an issue where a privileged
kubernetes daemonset is writing a socket into `/etc`. This makes ostree upgrades barf.
Now, they should clearly move it to `/run`. However, one option is for us to
just ignore it instead of erroring out. Some brief investigation shows that
e.g. `git add somesocket` is a silent no-op, which is an argument in favor of ignoring it.
Closes: https://github.com/ostreedev/ostree/issues/2446
The logic for `--selinux-policy` ended up in the `--tree=dir`
path, but there's no reason for that. Fix the imported
labeling with `--tree=tar`. Prep for use with containers.
We had this bug because the previous logic was trying to avoid
duplicating the code for generic `--selinux-policy` and
the case of `--selinux-policy-from-base --tree=dir`.
It's a bit more code, but it's cleaner if we dis-entangle them.
Having to touch a global test counter when adding tests is
a recipe for conflicts between PRs.
The TAP protocol allows *ending* with the expected number of
tests, so the best way to do this is to have an explicit
API like our `tap_ok` which bumps a counter, then end with `tap_end`.
I ported one test as a demo.
We're waaay overdue for this, it's been the default
in rpm-ostree for years, and solves several important bugs
around not capturing `/etc` while things are running.
Also, `ostree admin upgrade --stage` (should) become idempotent.
Closes: https://github.com/ostreedev/ostree/issues/2389
If we fail as a result of `set -x`, It's often not completely obvious
which command failed or how. Use a trap on ERR to show the command that
failed, and its exit status.
Signed-off-by: Simon McVittie <smcv@collabora.com>
[Originally from bubblewrap commits c5c999a7 "tests: test --userns"
and 3e5fe1bf "tests: Better error message if assert_files_equal fails";
separated into this commit by Simon McVittie.]
We struggled for a long time with enablement of our "internal units",
trying to follow the philosophy that units should only be enabled
by explicit preset.
See https://bugzilla.redhat.com/show_bug.cgi?id=1451458
and https://github.com/coreos/rpm-ostree/pull/1482
etc.
And I just saw chat (RH internal on a proprietary system sadly) where
someone hit `ostree-remount.service` not being enabled in CentOS8.
Thinking about this more, I realized we've shipped a systemd generator
for a long time and while its only role until now was to generate `var.mount`,
but by using it to force on our internal units, we don't require
people to deal with presets anymore.
Basically we're inverting things so that "if ostree= is on the kernel
cmdline, then enable our units" and not "enable our units, but have
them use ConditionKernelCmdline=ostree to skip".
Drop the weird gyrations we were doing around `ostree-finalize-staged.path`
too; forking `systemctl start` is just asking for bugs.
So after this, hopefully we won't ever again have to think about
distribution presets and our units.
This will be ignored, so let's make it very clear
people are doing something wrong. Motivated by a bug
in a build pipeline that injected `/var/lib/rpm` into an ostree
commit which ended up crashing rpm-ostree because it was an empty db
which it wasn't expecting.
It *also* turns out rpm-ostree is incorrectly dumping content in the
deployment `/var` today, which is another bug.