IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This will be useful for debugging process access rights once we start implementing SELinux
Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
The core change is moving the context out of the `ServiceRunner` struct
to be a local variable, and using a channel to notify about shutdown
events.
Add more synchronization between Run and the moment service started to
avoid mis-identifying not running (yet) service as successfully finished.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-authored-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This is a small quality of life improvement that allows `logs` subcommand to suggest all available logs.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
It was deprecated 16 months ago, time to cleanup.
(This is to prepare for the first v1.7 release)
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Fixes#6391
Implement a set of APIs and commands to manage images in the CRI, and
pre-pull images on Kubernetes upgrades.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Allow specifying the reboot mode during upgrades by introducing `--reboot-mode` flag, similar to the `--mode` flag of the reboot command.
Closessiderolabs/talos#7302.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
talosctl netstat -k show all host and non-hostnetwork pods sockets/connections.
talosctl netstat namespace/pod shows sockets/connections of a specific pod +
autocompletes in the shell.
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This allows to put keys to META partition.
META contents can be viewed with `talosctl get metakeys`.
There is not real usecase for it yet, but the next PRs will introduce
two special keys which can be written:
* platform network config for `metal`
* `${code}` variable
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes: https://github.com/siderolabs/talos/issues/6815
Additionally, make it possible to run reset in maintenance mode: to
enable a way for resetting system disk and remove all traces of Talos
from it.
The new reset flow works in a separate sequence, changed disk probe
lookup to check the boot partition instead of the ephemeral one.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
This allows to safely recover out of space quota issues, and perform
degragmentation as needed.
`talosctl etcd status` command provides lots of information about the
cluster health.
See docs for more details.
Fixes#4889
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
We add a controller that provides the etcd member id as a resource
and change the etcd related commands to support member ids next to
hostnames.
Fixes: #6223
Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Add a new field `actorID` to the events and populate it with a UUID for the lifecycle actions `reboot`, `reset`, `upgrade` and `shutdown`. This actor ID will be present on all events emitted by this triggered action. We can use this ID later on the client side to be able to track triggered actions.
We also emit an event with an empty payload on the events streaming GRPC endpoint when a client connects. The purpose of this event is to signal to the client that the event streaming has actually started.
Server-side part of siderolabs/talos#5499.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This uses the `go-packet` library with native bindings for the packet
capture (without `libpcap`). This is not the most performant way, but it
allows us to avoid CGo.
There is a problem with converting network filter expressions (like
`tcp port 3222`) into BPF instructions, it's only available in C
libraries, but there's a workaround with `tcpdump`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.
The mode can only be used with changes that can be applied without a
reboot.
When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.
Any consequent configuration change will abort try mode and the last
applied configuration will be used.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Dry run prints out config diff, selected application mode without
changing the configuration.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
They were discovered as we tagged 1.0.0 version:
* wrong deprecated version
* incompatibility in extension compatibility checks
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Cordon & drain a node when the Shutdown message is received.
Also adds a '--force' option to the shutdown command in case the control
plane is unresponsive.
Signed-off-by: Tim Jones <timniverse@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This adds information about file ownership in the long listing which is
crucial sometimes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#3714
This provides more safe way to join new members to the etcd cluster.
See https://etcd.io/docs/v3.4/learning/design-learner/
With learner mode join there are few differences:
* new nodes are joined one by one, because etcd enforces a single
learner member in the cluster
* learner members are not counted in quorum calculations, so while
learner catches up with the master node, quorum is not affected and
cluster is still operational
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#3951
Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.
This commit removes all backwards compatibility and removes conversion
code.
For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).
Remove control plane conversion code.
In k8s upgrade code, remove self-hosted part.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Sometimes `talosctl etcd snapshot` might not be available, for example
when etcd is not healthy. In that case it's possible to copy raw etcd
data directory with `talosctl cp /var/lib/etcd .` and use
`member/snap/db` to recover the cluster. But such copy won't pass
integrity checks, so they should be disabled explicitly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.
Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.
This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes: https://github.com/talos-systems/talos/issues/3219
We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.
No unit tests for that as it's going to destroy the test cluster.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).
Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Also fix recovery grpc handler to print panic stacktrace to the log.
Any API should follow the structure compatible with apid proxying
injection of errors/nodes.
Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
There are several ways Talos node might be restarted or shut down:
* error in sequence (initiated from machined)
* panic in main goroutine (machined recovers panics)
* error in sequence (initiated via API, event caught by machined)
* reboot/shutdown via Talos API
Before this change, paths (1) and (2) were handled in machined, and no
disks were unmounted and processes killed, so technically all the
processes are running and potentially writing to the filesystems.
Paths (3) and (4) try to stop services (but not pods) and unmount
explicitly mounted filesystems, followed by reboot directly from
sequencer (bypassing machined handler).
There was a bug that user disks were never explicitly unmounted (but
they might have been unmounted if mounted on top `/var`).
This refactors all the reboot/shutdown paths to flow through machined's
main function: on paths (4) event is sent via event API from the
sequencer back to the machined and machined initiates proper shutdown
sequence.
Refactoring in machined leads to all the paths (1)-(4) flowing through
the same function `handle(error)`.
Added two additional checks before flushing buffers:
* kill all non-system processes, this also kills all mount namespaces
* unmount any filesystem backed by `/dev/*`
This ensures all filesystems are unmounted before buffers are flushed.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>