190 Commits

Author SHA1 Message Date
Andrey Smirnov
0c7ce1cd81 feat: remove remnants of bootkube support
Fixes #3951

Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11
support conversion of self-hosted bootkube-based control plane to the
new style control plane running as static pods managed by Talos.

This commit removes all backwards compatibility and removes conversion
code.

For the k8s controllers, `BootstrapStatus` is removed and a dependency
on `etcd` service status is added (as it was implicitly there via
`BootstrapStatus`).

Remove control plane conversion code.

In k8s upgrade code, remove self-hosted part.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-08-03 07:55:42 -07:00
Alexey Palazhchenko
eea750de2c chore: rename "join" type to "worker"
Closes #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-07-09 07:10:45 -07:00
Alexey Palazhchenko
bbf1c091d4 feat: add RBAC to talosctl version output
Refs #3852.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-28 07:10:25 -07:00
Artem Chernyshev
71fff02ff0 fix: revert back resource.proto order
Otherwise it breaks older `talosctl` compatibility.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-23 16:37:36 -07:00
Artem Chernyshev
1990ad2525 feat: add created and updated timestamps to the resource metadata
This will allow to keep track of when the resource was created and
updated.
Update is tied to the version bump.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-23 13:56:49 -07:00
Andrey Smirnov
d8c2bca1b5 feat: reimplement apid certificate generation on top of COSI
This PR can be split into two parts:

* controllers
* apid binding into COSI world

Controllers
-----------

* `k8s.EndpointController` provides control plane endpoints on worker
nodes (it isn't required for now on control plane nodes)
* `secrets.RootController` now provides OS top-level secrets (CA cert)
and secret configuration
* `secrets.APIController` generates API secrets (certificates) in a bit
different way for workers and control plane nodes: controlplane nodes
generate directly, while workers reach out to `trustd` on control plane
nodes via `k8s.Endpoint` resource

apid Binding
------------

Resource `secrets.API` provides binding to protobuf by converting
itself back and forth to protobuf spec.

apid no longer receives machine configuration, instead it receives
gRPC-backed socket to access Resource API. apid watches `secrets.API`
resource, fetches certs and CA from it and uses that in its TLS
configuration.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-23 13:07:00 -07:00
Alexey Palazhchenko
06209bba28 chore: update RBAC rules, remove old APIs
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-18 09:54:49 -07:00
Alexey Palazhchenko
f63ab9dd9b feat: implement talosctl config new command
Refs #3421.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-06-17 09:06:43 -07:00
Artem Chernyshev
14e696d068 feat: update COSI runtime and add support for tail in the Talos gRPC
Updated protobufs to expose tail length option.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-06-03 11:46:39 -07:00
Andrey Smirnov
33db8857aa fix: use COSI runtime DestroyReady input type
See https://github.com/cosi-project/runtime/pull/35

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-06-01 12:30:52 -07:00
Andrey Smirnov
c3a4173e11 chore: remove security API ReadFile/WriteFile
This seems to be unused completely, and they look scary enough at the
same time.

For better readability and to avoid any pitfalls, better to remove them.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-27 03:48:20 -07:00
Artem Chernyshev
9a91142a38 feat: print complete member info in etcd members
Fixes: https://github.com/talos-systems/talos/issues/3487

Example output:

```
NODE       ID                 HOSTNAME                 PEERS                   CLIENTS
10.5.0.2   c3d3020cf75b8728   talos-default-master-1   https://10.5.0.2:2380   https://10.5.0.2:2379
```

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-04-17 11:07:59 -07:00
Andrey Smirnov
0bd8b0e800 feat: provide an option to recover etcd from data directory copy
Sometimes `talosctl etcd snapshot` might not be available, for example
when etcd is not healthy. In that case it's possible to copy raw etcd
data directory with `talosctl cp /var/lib/etcd .` and use
`member/snap/db` to recover the cluster. But such copy won't pass
integrity checks, so they should be disabled explicitly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-14 08:25:32 -07:00
Alexey Palazhchenko
29da22d063 feat: add config validation warnings
Closes #3412.
Refs #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-08 13:49:58 -07:00
Andrey Smirnov
e0650218a6 feat: support etcd recovery from snapshot on bootstrap
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.

Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.

This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 10:15:37 -07:00
Andrey Smirnov
fbfd1eb2b1 refactor: pull new version of os-runtime, update code
This is mostly refactoring to adapt to the new APIs.

There are some small changes which are not user-visible immediately (but
visible when using `talosctl get` to inspect low-level details):

* `extras` namespace is removed, it was a hack to distinguish extra and
system manifests
* `Manifests` are managed by two controllers as shared outputs, stored
in the `controlplane` namespace now
* `talosctl inspect dependencies` output got slightly changed
* resources now have `md.owner` set to the controller name which manages
the resource

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-07 06:55:09 -07:00
Andrey Smirnov
e664362cec feat: add API and command to save etcd snapshot (backup)
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 09:20:16 -07:00
Artem Chernyshev
6ffabe5169 feat: add ability to find disk by disk properties
Fixes: https://github.com/talos-systems/talos/issues/3323

Not exactly matching with udevd generated `by-<id>` symlinks, but should
provide sufficient amount of property selectors to be able to pick
specific disks for any kind of disk: sd card, hdd, ssd, nvme.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-23 14:23:02 -07:00
Artem Chernyshev
376fdcf6cb feat: implement etcd remove-member cli command
Fixes: https://github.com/talos-systems/talos/issues/3219

We already have `etcd leave`, which makes the node exclude itself from
etcd members.
But in case if the node can't remove itself because it doesn't have
connection to etcd we need this etcd remove-member cli, which basically removes
a node from a different node.

No unit tests for that as it's going to destroy the test cluster.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-01 07:55:08 -08:00
Andrey Smirnov
589d01892c fix: update the layout of the Disks API to match proxying requirements
Fixes #3199

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-24 11:33:15 -08:00
Andrey Smirnov
7751920dba feat: add a tool and package to convert self-hosted CP to static pods
This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster
is fully upgraded, control plane is still self-hosted (as it was
bootstrapped with bootkube).

Tool `talosctl convert-k8s` (and library behind it) performs the upgrade
to self-hosted version.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 23:26:57 -08:00
Andrey Smirnov
e5bd35ae3c feat: add resource watch API + CLI
This uses API in `os-runtime` to pull the initial list of resources +
updates for resource by type.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 13:24:47 -08:00
Andrey Smirnov
cc83b83808 feat: rename apply-config --no-reboot to --on-reboot
This explains the intetion better: config is applied on reboot, and
allows to easily distinguish it from `apply-config --immediate` which
applies config immediately without a reboot (that is coming in a
different PR).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-17 12:49:47 -08:00
Andrey Smirnov
d99a016af2 fix: correct response structure for GenerateConfig API
Also fix recovery grpc handler to print panic stacktrace to the log.

Any API should follow the structure compatible with apid proxying
injection of errors/nodes.

Explicitly fail GenerateConfig API on worker nodes, as it panics on
worker nodes (missing certificates in node config).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-11 06:34:10 -08:00
Andrey Smirnov
edf5777222 feat: add an option to force upgrade without checks
Our upgrades are safe by default - we check etcd health, take locks,
etc. But sometimes upgrades might be a way to recover broken (or
semi-broken) cluster, in that case we need upgrade to run even if the
checks are not passing. This is not a safe way to do upgrades, but it
might be a way to recover a cluster.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-02-09 10:20:03 -08:00
Andrey Smirnov
76a6794436 fix: kill all processes and umount all disk on reboot/shutdown
There are several ways Talos node might be restarted or shut down:

* error in sequence (initiated from machined)
* panic in main goroutine (machined recovers panics)
* error in sequence (initiated via API, event caught by machined)
* reboot/shutdown via Talos API

Before this change, paths (1) and (2) were handled in machined, and no
disks were unmounted and processes killed, so technically all the
processes are running and potentially writing to the filesystems.
Paths (3) and (4) try to stop services (but not pods) and unmount
explicitly mounted filesystems, followed by reboot directly from
sequencer (bypassing machined handler).

There was a bug that user disks were never explicitly unmounted (but
they might have been unmounted if mounted on top `/var`).

This refactors all the reboot/shutdown paths to flow through machined's
main function: on paths (4) event is sent via event API from the
sequencer back to the machined and machined initiates proper shutdown
sequence.

Refactoring in machined leads to all the paths (1)-(4) flowing through
the same function `handle(error)`.

Added two additional checks before flushing buffers:

* kill all non-system processes, this also kills all mount namespaces
* unmount any filesystem backed by `/dev/*`

This ensures all filesystems are unmounted before buffers are flushed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-29 06:14:07 -08:00
Andrey Smirnov
0aaf8fa968 feat: replace bootkube with Talos-managed control plane
Control plane components are running as static pods managed by the
kubelets.

Whole subsystem is managed via resources/controllers from os-runtime.

Many supporting changes/refactoring to enable new code paths.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-26 14:22:35 -08:00
Andrey Smirnov
11863dd74d feat: implement resource API in Talos
This brings in `os-runtime` package and exposes resources with first
iteration of read-only API.

Two Talos resources (and one controller) are implemented:

* legacy.Service resource tracks Talos 'service' `RUNNING` state
* config.V1Alpha1 stores current runtime config

Glue point between existing runtime and new os-runtime based runtime is
in `v1alpha2` implementation and `V1Alpha2()` sub-interfaces of existing
`Runtime`, `State`, `Controller` interfaces.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-01-19 11:45:46 -08:00
Alexey Palazhchenko
f3465b8e3e feat: support type filter in list API and CLI
Closes #2068.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2020-12-24 06:34:02 -08:00
Andrey Smirnov
6a0e652f0c fix: correctly transport gRPC errors from apid
Before these changes, errors were always sent as strings, so if original
error was gRPC error (which is almost always the case for apid), it is
formatted as string and original fields (like code) are lost in the
formatted string.

With this change, apid sends errors as official `grpc.Status` protobuf
structure, and client decodes that into Go grpc.Status based error.

This change is backwards and forwards compatible.

This should fix more cases when integration tests were not able to
ignore grpc `transport is closing` errors when they were sent as strings
from the apid endpoint.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-23 11:08:51 -08:00
Artem Chernyshev
a83e8758db feat: add commands to manage/query etcd cluster
Used already existing protobufs for that.

Commands:
`talosctl etcd members -n <node>`
`talosctl etcd leave -n <node>`
`talosctl etcd forfeit-leadership -n <node>`

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-22 11:49:10 -08:00
Andrey Smirnov
54ed80e244 feat: reset with system disk wipe spec
Idea is to add an option to perform "selective" reset: default reset
operation is to wipe all partitions (triggering reinstall), while spec
allows only to wipe some of the operations.

Other operations are performed exactly in the same way for any reset
flow.

Possible use case: reset only `EPHEMERAL` partition.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-10 11:31:07 -08:00
Andrey Smirnov
350280eb59 feat: implement "staged" (failsafe/backup) upgrades
Regular upgrade path takes just one reboot, but it requires all the
processes to be stopped on the node before upgrade might proceed. Under
some circumstances and with potential Talos bugs it might not work
rendering Talos upgrades almost impossible.

Staged upgrades build upon regular install flow to run the upgrade on
the node reboot. Such upgrades require two reboots of the node, and it
requires two pulls of the installer image, but they should be much less
suspicious to the failure. Once the upgrade is staged, node can be
rebooted in any possible way, including hard reset and upgrade is
performed on the next boot.

New ADV format was implemented as well to allow to store install image
ref/options across reboots. New format allows for bigger values and
takes 50% of the `META` partition. Old ADV is still kept for
compatibility reasons.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-12-08 08:34:26 -08:00
Artem Chernyshev
5d48bd5f6a feat: allow disabling NoSchedule taint on masters using TUI installer
I think this should come handy for setting up single node SBC clusters.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-07 07:31:54 -08:00
Artem Chernyshev
63e0d02aa9 feat: add TUI for configuring network interfaces settings
Allows configuring:
- cidr.
- dhcp enable/disable.
- MTU.
- Ignore.
- Dhcp metric.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-03 11:05:55 -08:00
Artem Chernyshev
c7062e3f4d feat: make GenerateConfiguration accept current time as a parameter
If the node time is out of sync, it can generate incorrect
configuration. And maintenance mode does not allow us starting ntp,
because there is no containerd.

By providing current UTC time of the machine where talosctl client is
running, it is possible to force GenerateConfiguration use correct time.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-12-03 08:28:11 -08:00
Artem Chernyshev
f96cffd2b2 feat: add ability to choose CNI config
Initial version which only allows setting CNI using preset, no custom
CNI urls are supported at the moment. Still need to figure out what kind
of UI can be used for that.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-26 06:49:54 -08:00
Andrey Smirnov
9a32e34cb1 feat: implement apply configuration without reboot
This allows config to be written to disk without being applied
immediately.

Small refactoring to extract common code paths.

At first, I tried to implement this via the sequencer, but looks like
it's too hard to get it right, as sequencer lacks context and config to
be written is not applied to the runtime.

Fixes #2828

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-23 12:42:44 -08:00
Artem Chernyshev
8513123d22 feat: return client config as the second value in GenerateConfiguration
To be used in interactive installer to output the node client
configuration to a file.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-17 07:20:05 -08:00
Artem Chernyshev
0f924b5122 feat: add generate config gRPC API
Fixes: https://github.com/talos-systems/talos/issues/2766

This API is implemented in Maintenance and Machine services.
Can be used to generate configuration on the node, instead of using
talosctl to generate it locally.

To be used in interactive installer and talosctl gen config.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-13 08:07:32 -08:00
Artem Chernyshev
93e30a1738 chore: remove maintenance service interface and use machine service
Now maintenance service implements `MachineService` interface, stubbing
all not implemented methods.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-11-11 12:33:44 -08:00
Andrew Rynhard
71321214a1 feat: add storage API
This is the initial implementation of a storage API.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-11-11 10:12:25 -08:00
Andrey Smirnov
026244097a refactor: drop osd compatibility layer
Fixes #2761

Service `osd` was merged into machined on Jul, 13th, before 0.6 release.

It's time to drop the backwards compatibility with clients before 0.6.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-11-11 09:38:19 -08:00
Andrew Rynhard
562f816526 refactor: use gRPC for interactive installation
Instead of hosting a web service, we decided to implement a gRPC service
that exposes APIs that can be used in a client-side interactive installer.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-11-03 08:36:44 -08:00
Artem Chernyshev
e7e99cf1b3 feat: support disk usage command in talosctl
Usage example:

```bash
talosctl du --nodes 10.5.0.2 /var -H -d 2
NODE       NAME
10.5.0.2   8.4 kB   etc
10.5.0.2   1.3 GB   lib
10.5.0.2   16 MB    log
10.5.0.2   25 kB    run
10.5.0.2   4.1 kB   tmp
10.5.0.2   1.3 GB   .
```

Supported flags:
- `-a` writes counts for all files, not just directories.
- `-d` recursion depth
- '-H' humanize size outputs.
- '-t' size threshold (skip files if < size or > size).

Fixes: https://github.com/talos-systems/talos/issues/2504

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-10-13 09:30:31 -07:00
Andrew Rynhard
4eeef28e90 feat: add etcd API
This adds RPCs for basic etcd management tasks.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-10-06 11:30:04 -07:00
Seán C McCord
ff92d2a14b feat: add ApplyConfiguration API
Adds the ability to apply (replace) an existing node configuration with
a new one via the Machine API.

Fixes #2345

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2020-09-29 14:44:06 -07:00
Andrey Smirnov
bddd4f1bf6 refactor: move external API packages into machinery/
This moves `pkg/config`, `pkg/client` and `pkg/constants`
under `pkg/machinery` umbrella.

And `pkg/machinery` is published as Go module inside Talos repository.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-08-17 09:56:14 -07:00
Andrey Smirnov
74413b1393 fix: ignore sequence lock errors in machined
This prevents reboots when some actions triggers sequence while another
sequence is still running.

Fixes #2209

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 14:36:06 -07:00
Andrey Smirnov
ad99cb6421 feat: implement talosctl dashboard command
This builds a simple CLI UI for Talos cluster monitoring.

Some new APIs were added for monitoring based on Prometheus procfs
package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 14:24:04 -07:00