100 Commits

Author SHA1 Message Date
Dmitry Sharshakov
6fbd1263cc
feat: report process MAC labels
This will be useful for debugging process access rights once we start implementing SELinux

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
2024-04-22 18:16:33 +03:00
Andrey Smirnov
89fc68b459
fix: service lifecycle issues
The core change is moving the context out of the `ServiceRunner` struct
to be a local variable, and using a channel to notify about shutdown
events.

Add more synchronization between Run and the moment service started to
avoid mis-identifying not running (yet) service as successfully finished.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-authored-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-03-19 18:11:13 +04:00
Dmitriy Matrenichev
32e0877607
chore: print all available logs containers in logs command completions
This is a small quality of life improvement that allows `logs` subcommand to suggest all available logs.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-03-11 17:48:01 +03:00
Andrey Smirnov
9d8cd4d058
chore: drop deprecated method EtcdRemoveMember
It was deprecated 16 months ago, time to cleanup.

(This is to prepare for the first v1.7 release)

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 15:54:29 +04:00
Oscar Utbult
020a0eb63e
docs: fix table formatting for bootstraprequest
Fixes formatting for https://www.talos.dev/v1.6/reference/api/#bootstraprequest

Signed-off-by: Oscar Utbult <oscar.utbult@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-20 20:11:43 +04:00
Oscar Utbult
de6caf5348
docs: fix table formatting for machineservice api
Fixes formatting for https://www.talos.dev/v1.6/reference/api/#machineservice

Signed-off-by: Oscar Utbult <oscar.utbult@gmail.com>
2023-11-20 17:13:10 +04:00
Andrey Smirnov
8017afb107
feat: implement CRI image management and pre-pull on K8s upgrade
Fixes #6391

Implement a set of APIs and commands to manage images in the CRI, and
pre-pull images on Kubernetes upgrades.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-11 19:25:10 +04:00
Utku Ozdemir
0d313b9733
feat: add reboot-mode flag to talosctl upgrade
Allow specifying the reboot mode during upgrades by introducing `--reboot-mode` flag, similar to the `--mode` flag of the reboot command.

Closes siderolabs/talos#7302.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-06-26 17:37:19 +02:00
Nico Berlee
0af8fe2fb5
feat: netstat pod support
talosctl netstat -k show all host and non-hostnetwork pods sockets/connections.
talosctl netstat namespace/pod shows sockets/connections of a specific pod +
autocompletes in the shell.

Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-30 23:39:38 +04:00
Andrey Smirnov
442cb9c1b0
feat: implement APIs to write to META
This allows to put keys to META partition.

META contents can be viewed with `talosctl get metakeys`.

There is not real usecase for it yet, but the next PRs will introduce
two special keys which can be written:

* platform network config for `metal`
* `${code}` variable

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-15 22:17:52 +04:00
Nico Berlee
97048f7c37
feat: netstat in API and client
Implements netstat in Talos API and client (talosctl).

Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-09 15:48:30 +04:00
Artem Chernyshev
b520710810
feat: introduce new flag in reset API that makes Talos reset user disks
Fixes: https://github.com/siderolabs/talos/issues/6815

Additionally, make it possible to run reset in maintenance mode: to
enable a way for resetting system disk and remove all traces of Talos
from it.

The new reset flow works in a separate sequence, changed disk probe
lookup to check the boot partition instead of the ephemeral one.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-02-28 15:10:41 +03:00
Andrey Smirnov
96629d5ba6
feat: implement etcd maintenance commands
This allows to safely recover out of space quota issues, and perform
degragmentation as needed.

`talosctl etcd status` command provides lots of information about the
cluster health.

See docs for more details.

Fixes #4889

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-03 23:25:28 +04:00
Andrey Smirnov
89dbb0ecf0
release(v1.4.0-alpha.0): prepare release
This is the official v1.4.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-23 22:32:09 +04:00
Philipp Sauter
4e114ca120
feat: use the etcd member id for etcd operations instead of hostname
We add a controller that provides the etcd member id as a resource
and change the etcd related commands to support member ids next to
hostnames.

Fixes: #6223

Signed-off-by: Philipp Sauter <philipp.sauter@siderolabs.com>
2022-11-10 19:17:56 +04:00
Andrey Smirnov
96aa9638f7
chore: rename talos-systems/talos to siderolabs/talos
There's a cyclic dependency on siderolink library which imports talos
machinery back. We will fix that after we get talos pushed under a new
name.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-11-03 16:50:32 +04:00
Utku Ozdemir
b5da686a7b
feat: add actor ID to events & emit an initial empty event
Add a new field `actorID` to the events and populate it with a UUID for the lifecycle actions `reboot`, `reset`, `upgrade` and `shutdown`. This actor ID will be present on all events emitted by this triggered action. We can use this ID later on the client side to be able to track triggered actions.

We also emit an event with an empty payload on the events streaming GRPC endpoint when a client connects. The purpose of this event is to signal to the client that the event streaming has actually started.

Server-side part of siderolabs/talos#5499.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2022-08-11 15:14:11 +02:00
Noel Georgi
b62b18a972
feat: bump k8s to v1.25.0-beta.0
Bump k8s to v1.25.0-beta.0

Update most kubernetes `master` references to `controlplane`

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-08-10 22:17:53 +05:30
Andrey Smirnov
fe2ee3b100
feat: implement MachineStatus resource
Fixes #5789

Example:

```yaml
spec:
    stage: running
    status:
        ready: false
        unmetConditions:
            - name: staticPods
              reason: kube-system/kube-controller-manager-talos-default-master-1 not ready, kube-system/kube-scheduler-talos-default-master-1 not ready
```

As events (CLI doesn't show full contents):

```
172.20.0.2   cbhf2l6f9lrs738hehfg   talos/runtime/machine.MachineStatusEvent   BOOTING   ready: false, unmet conditions: [time network services]
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-08-01 18:36:10 +04:00
Andrey Smirnov
065b59276c
feat: implement packet capture API
This uses the `go-packet` library with native bindings for the packet
capture (without `libpcap`). This is not the most performant way, but it
allows us to avoid CGo.

There is a problem with converting network filter expressions (like
`tcp port 3222`) into BPF instructions, it's only available in C
libraries, but there's a workaround with `tcpdump`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-07-19 01:23:09 +04:00
Andrey Smirnov
022581d809
release(v1.2.0-alpha.0): prepare release
This is the official v1.2.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-06-30 19:01:07 +04:00
Artem Chernyshev
2b03057b91
feat: implement a new mode try in the config manipulation commands
The new mode allows changing the config for a period of time, which
allows trying the configuration and automatically rolling it back in case
if it doesn't work for example.

The mode can only be used with changes that can be applied without a
reboot.

When changed it doesn't write the configuration to disk, only changes it
in memory.
`--timeout` parameter can be used to customize the rollback delay.
The default timeout is 1 minute.

Any consequent configuration change will abort try mode and the last
applied configuration will be used.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-04-21 20:31:45 +03:00
Artem Chernyshev
2b9722d1f5
feat: add dry-run flag in apply-config and edit commands
Dry run prints out config diff, selected application mode without
changing the configuration.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-04-14 19:12:57 +03:00
Andrey Smirnov
25d19131d3
release(v1.1.0-alpha.0): prepare release
This is the official v1.1.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-04-01 18:23:19 +03:00
Andrey Smirnov
59681b8c9a
fix: backport fixes from release-1.0 branch
They were discovered as we tagged 1.0.0 version:

* wrong deprecated version
* incompatibility in extension compatibility checks

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-04 23:28:06 +03:00
Tim Jones
fe40e7b1b3
feat: drain node on shutdown
Cordon & drain a node when the Shutdown message is received.
Also adds a '--force' option to the shutdown command in case the control
plane is unresponsive.

Signed-off-by: Tim Jones <timniverse@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-01 00:06:32 +03:00
Artem Chernyshev
2f2bdb26aa
feat: replace flags with --mode in apply, edit and patch commands
Fixes: https://github.com/talos-systems/talos/issues/4588

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-01-13 16:09:53 +03:00
Andrey Smirnov
cb548a368a
release(v0.15.0-alpha.0): prepare release
This is the official v0.15.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-12-30 16:27:19 +03:00
Rohit Dandamudi
7f9922296a
feat: add powercycle mode in reboot
- Fixes #4569
- Updated reboot process sequence
- Updted api.descriptors to avoid proto type change linting error https://github.com/talos-systems/talos/pull/4612#discussion_r758599242
Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>

Signed-off-by: Rohit Dandamudi <rohit.dandamudi@siderolabs.com>
2021-12-02 22:40:04 +05:30
Alexey Palazhchenko
0f169bf9b1
chore: add API deprecations mechanism
Refs #4576.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-30 06:31:55 +00:00
Alexey Palazhchenko
20d39c0b48
chore: format .proto files
Refs #2722.

Co-authored-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@talos-systems.com>
2021-11-23 15:05:25 +00:00
Artem Chernyshev
f730252579
feat: add new event types
Add config load + validation errors and address + hostnames events.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2021-11-18 18:48:35 +03:00
Andrey Smirnov
dadaa65d54
feat: print uid/gid for the files in ls -l
This adds information about file ownership in the long listing which is
crucial sometimes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-08-13 00:10:49 +03:00
Andrey Smirnov
eefe1c21c3
feat: add new etcd members in learner mode
Fixes #3714

This provides more safe way to join new members to the etcd cluster.

See https://etcd.io/docs/v3.4/learning/design-learner/

With learner mode join there are few differences:

* new nodes are joined one by one, because etcd enforces a single
learner member in the cluster
* learner members are not counted in quorum calculations, so while
learner catches up with the master node, quorum is not affected and
cluster is still operational

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2021-08-12 17:56:57 +03:00
Andrey Smirnov
0c7ce1cd81 feat: remove remnants of bootkube support
Fixes #3951

Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.

This commit removes all backwards compatibility and removes conversion
code.

For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).

Remove control plane conversion code.

In k8s upgrade code, remove self-hosted part.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-08-03 07:55:42 -07:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Alexey Palazhchenko
bbf1c091d4 feat: add RBAC to talosctl version output
Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 07:10:25 -07:00
Alexey Palazhchenko
06209bba28 chore: update RBAC rules, remove old APIs
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-18 09:54:49 -07:00
Alexey Palazhchenko
f63ab9dd9b feat: implement talosctl config new command
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-17 09:06:43 -07:00
Artem Chernyshev
9a91142a38 feat: print complete member info in etcd members
Fixes: https://github.com/talos-systems/talos/issues/3487

Example output:

```
NODE       ID                 HOSTNAME                 PEERS                   CLIENTS
10.5.0.2   c3d3020cf75b8728   talos-default-master-1   https://10.5.0.2:2380   https://10.5.0.2:2379
```

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-04-17 11:07:59 -07:00
Andrey Smirnov
0bd8b0e800 feat: provide an option to recover etcd from data directory copy
Sometimes `talosctl etcd snapshot` might not be available, for example
when etcd is not healthy. In that case it's possible to copy raw etcd
data directory with `talosctl cp /var/lib/etcd .` and use
`member/snap/db` to recover the cluster. But such copy won't pass
integrity checks, so they should be disabled explicitly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-14 08:25:32 -07:00
Alexey Palazhchenko
29da22d063 feat: add config validation warnings
Closes #3412.
Refs #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-08 13:49:58 -07:00
Andrey Smirnov
e0650218a6 feat: support etcd recovery from snapshot on bootstrap
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.

Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.

This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 10:15:37 -07:00
Andrey Smirnov
e664362cec feat: add API and command to save etcd snapshot (backup)
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 09:20:16 -07:00
Artem Chernyshev
376fdcf6cb feat: implement etcd remove-member cli command
Fixes: https://github.com/talos-systems/talos/issues/3219

We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.

No unit tests for that as it's going to destroy the test cluster.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-01 07:55:08 -08:00
Andrey Smirnov
7751920dba feat: add a tool and package to convert self-hosted CP to static pods
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).

Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 23:26:57 -08:00
Andrey Smirnov
cc83b83808 feat: rename apply-config --no-reboot to --on-reboot
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 12:49:47 -08:00
Andrey Smirnov
d99a016af2 fix: correct response structure for GenerateConfig API
Also fix recovery grpc handler to print panic stacktrace to the log.

Any API should follow the structure compatible with apid proxying
injection of errors/nodes.

Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-11 06:34:10 -08:00
Andrey Smirnov
edf5777222 feat: add an option to force upgrade without checks
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 10:20:03 -08:00
Andrey Smirnov
76a6794436 fix: kill all processes and umount all disk on reboot/shutdown
There are several ways Talos node might be restarted or shut down:

* error in sequence (initiated from machined)
* panic in main goroutine (machined recovers panics)
* error in sequence (initiated via API, event caught by machined)
* reboot/shutdown via Talos API

Before this change, paths (1) and (2) were handled in machined, and no
disks were unmounted and processes killed, so technically all the
processes are running and potentially writing to the filesystems.
Paths (3) and (4) try to stop services (but not pods) and unmount
explicitly mounted filesystems, followed by reboot directly from
sequencer (bypassing machined handler).

There was a bug that user disks were never explicitly unmounted (but
they might have been unmounted if mounted on top `/var`).

This refactors all the reboot/shutdown paths to flow through machined's
main function: on paths (4) event is sent via event API from the
sequencer back to the machined and machined initiates proper shutdown
sequence.

Refactoring in machined leads to all the paths (1)-(4) flowing through
the same function `handle(error)`.

Added two additional checks before flushing buffers:

* kill all non-system processes, this also kills all mount namespaces
* unmount any filesystem backed by `/dev/*`

This ensures all filesystems are unmounted before buffers are flushed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-29 06:14:07 -08:00