IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Fixes#3951
Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.
This commit removes all backwards compatibility and removes conversion
code.
For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).
Remove control plane conversion code.
In k8s upgrade code, remove self-hosted part.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This will allow to keep track of when the resource was created and
updated.
Update is tied to the version bump.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This PR can be split into two parts:
* controllers
* apid binding into COSI world
Controllers
-----------
* `k8s.EndpointController` provides control plane endpoints on worker
nodes (it isn't required for now on control plane nodes)
* `secrets.RootController` now provides OS top-level secrets (CA cert)
and secret configuration
* `secrets.APIController` generates API secrets (certificates) in a bit
different way for workers and control plane nodes: controlplane nodes
generate directly, while workers reach out to `trustd` on control plane
nodes via `k8s.Endpoint` resource
apid Binding
------------
Resource `secrets.API` provides binding to protobuf by converting
itself back and forth to protobuf spec.
apid no longer receives machine configuration, instead it receives
gRPC-backed socket to access Resource API. apid watches `secrets.API`
resource, fetches certs and CA from it and uses that in its TLS
configuration.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This seems to be unused completely, and they look scary enough at the
same time.
For better readability and to avoid any pitfalls, better to remove them.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Sometimes `talosctl etcd snapshot` might not be available, for example
when etcd is not healthy. In that case it's possible to copy raw etcd
data directory with `talosctl cp /var/lib/etcd .` and use
`member/snap/db` to recover the cluster. But such copy won't pass
integrity checks, so they should be disabled explicitly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.
Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.
This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This is mostly refactoring to adapt to the new APIs.
There are some small changes which are not user-visible immediately (but
visible when using `talosctl get` to inspect low-level details):
* `extras` namespace is removed, it was a hack to distinguish extra and
system manifests
* `Manifests` are managed by two controllers as shared outputs, stored
in the `controlplane` namespace now
* `talosctl inspect dependencies` output got slightly changed
* resources now have `md.owner` set to the controller name which manages
the resource
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3323
Not exactly matching with udevd generated `by-<id>` symlinks, but should
provide sufficient amount of property selectors to be able to pick
specific disks for any kind of disk: sd card, hdd, ssd, nvme.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3219
We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.
No unit tests for that as it's going to destroy the test cluster.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).
Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This uses API in `os-runtime` to pull the initial list of resources +
updates for resource by type.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Also fix recovery grpc handler to print panic stacktrace to the log.
Any API should follow the structure compatible with apid proxying
injection of errors/nodes.
Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
There are several ways Talos node might be restarted or shut down:
* error in sequence (initiated from machined)
* panic in main goroutine (machined recovers panics)
* error in sequence (initiated via API, event caught by machined)
* reboot/shutdown via Talos API
Before this change, paths (1) and (2) were handled in machined, and no
disks were unmounted and processes killed, so technically all the
processes are running and potentially writing to the filesystems.
Paths (3) and (4) try to stop services (but not pods) and unmount
explicitly mounted filesystems, followed by reboot directly from
sequencer (bypassing machined handler).
There was a bug that user disks were never explicitly unmounted (but
they might have been unmounted if mounted on top `/var`).
This refactors all the reboot/shutdown paths to flow through machined's
main function: on paths (4) event is sent via event API from the
sequencer back to the machined and machined initiates proper shutdown
sequence.
Refactoring in machined leads to all the paths (1)-(4) flowing through
the same function `handle(error)`.
Added two additional checks before flushing buffers:
* kill all non-system processes, this also kills all mount namespaces
* unmount any filesystem backed by `/dev/*`
This ensures all filesystems are unmounted before buffers are flushed.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Control plane components are running as static pods managed by the
kubelets.
Whole subsystem is managed via resources/controllers from os-runtime.
Many supporting changes/refactoring to enable new code paths.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This brings in `os-runtime` package and exposes resources with first
iteration of read-only API.
Two Talos resources (and one controller) are implemented:
* legacy.Service resource tracks Talos 'service' `RUNNING` state
* config.V1Alpha1 stores current runtime config
Glue point between existing runtime and new os-runtime based runtime is
in `v1alpha2` implementation and `V1Alpha2()` sub-interfaces of existing
`Runtime`, `State`, `Controller` interfaces.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Before these changes, errors were always sent as strings, so if original
error was gRPC error (which is almost always the case for apid), it is
formatted as string and original fields (like code) are lost in the
formatted string.
With this change, apid sends errors as official `grpc.Status` protobuf
structure, and client decodes that into Go grpc.Status based error.
This change is backwards and forwards compatible.
This should fix more cases when integration tests were not able to
ignore grpc `transport is closing` errors when they were sent as strings
from the apid endpoint.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Idea is to add an option to perform "selective" reset: default reset
operation is to wipe all partitions (triggering reinstall), while spec
allows only to wipe some of the operations.
Other operations are performed exactly in the same way for any reset
flow.
Possible use case: reset only `EPHEMERAL` partition.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.
Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.
New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
If the node time is out of sync, it can generate incorrect
configuration. And maintenance mode does not allow us starting ntp,
because there is no containerd.
By providing current UTC time of the machine where talosctl client is
running, it is possible to force GenerateConfiguration use correct time.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Initial version which only allows setting CNI using preset, no custom
CNI urls are supported at the moment. Still need to figure out what kind
of UI can be used for that.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This allows config to be written to disk without being applied
immediately.
Small refactoring to extract common code paths.
At first, I tried to implement this via the sequencer, but looks like
it's too hard to get it right, as sequencer lacks context and config to
be written is not applied to the runtime.
Fixes#2828
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/2766
This API is implemented in Maintenance and Machine services.
Can be used to generate configuration on the node, instead of using
talosctl to generate it locally.
To be used in interactive installer and talosctl gen config.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Now maintenance service implements `MachineService` interface, stubbing
all not implemented methods.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Fixes#2761
Service `osd` was merged into machined on Jul, 13th, before 0.6 release.
It's time to drop the backwards compatibility with clients before 0.6.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Instead of hosting a web service, we decided to implement a gRPC service
that exposes APIs that can be used in a client-side interactive installer.
Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
Adds the ability to apply (replace) an existing node configuration with
a new one via the Machine API.
Fixes#2345
Signed-off-by: Seán C McCord <ulexus@gmail.com>
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.
And `pkg/machinery` is published as Go module inside Talos repository.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This prevents reboots when some actions triggers sequence while another
sequence is still running.
Fixes#2209
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This builds a simple CLI UI for Talos cluster monitoring.
Some new APIs were added for monitoring based on Prometheus procfs
package.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>