talos

Author	SHA1	Message	Date
Andrey Smirnov	0c7ce1cd81	feat: remove remnants of bootkube support Fixes #3951 Bootkube support was removed in Talos 0.9. Talos versions 0.9-0.11 support conversion of self-hosted bootkube-based control plane to the new style control plane running as static pods managed by Talos. This commit removes all backwards compatibility and removes conversion code. For the k8s controllers, `BootstrapStatus` is removed and a dependency on `etcd` service status is added (as it was implicitly there via `BootstrapStatus`). Remove control plane conversion code. In k8s upgrade code, remove self-hosted part. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-08-03 07:55:42 -07:00
Alexey Palazhchenko	eea750de2c	chore: rename "join" type to "worker" Closes #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-07-09 07:10:45 -07:00
Alexey Palazhchenko	bbf1c091d4	feat: add RBAC to `talosctl version` output Refs #3852. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-28 07:10:25 -07:00
Artem Chernyshev	71fff02ff0	fix: revert back resource.proto order Otherwise it breaks older `talosctl` compatibility. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-06-23 16:37:36 -07:00
Artem Chernyshev	1990ad2525	feat: add created and updated timestamps to the resource metadata This will allow to keep track of when the resource was created and updated. Update is tied to the version bump. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-06-23 13:56:49 -07:00
Andrey Smirnov	d8c2bca1b5	feat: reimplement apid certificate generation on top of COSI This PR can be split into two parts: * controllers * apid binding into COSI world Controllers ----------- * `k8s.EndpointController` provides control plane endpoints on worker nodes (it isn't required for now on control plane nodes) * `secrets.RootController` now provides OS top-level secrets (CA cert) and secret configuration * `secrets.APIController` generates API secrets (certificates) in a bit different way for workers and control plane nodes: controlplane nodes generate directly, while workers reach out to `trustd` on control plane nodes via `k8s.Endpoint` resource apid Binding ------------ Resource `secrets.API` provides binding to protobuf by converting itself back and forth to protobuf spec. apid no longer receives machine configuration, instead it receives gRPC-backed socket to access Resource API. apid watches `secrets.API` resource, fetches certs and CA from it and uses that in its TLS configuration. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-23 13:07:00 -07:00
Alexey Palazhchenko	06209bba28	chore: update RBAC rules, remove old APIs Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-18 09:54:49 -07:00
Alexey Palazhchenko	f63ab9dd9b	feat: implement `talosctl config new` command Refs #3421. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-06-17 09:06:43 -07:00
Artem Chernyshev	14e696d068	feat: update COSI runtime and add support for tail in the Talos gRPC Updated protobufs to expose tail length option. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-06-03 11:46:39 -07:00
Andrey Smirnov	33db8857aa	fix: use COSI runtime DestroyReady input type See https://github.com/cosi-project/runtime/pull/35 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-06-01 12:30:52 -07:00
Andrey Smirnov	c3a4173e11	chore: remove security API ReadFile/WriteFile This seems to be unused completely, and they look scary enough at the same time. For better readability and to avoid any pitfalls, better to remove them. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-27 03:48:20 -07:00
Artem Chernyshev	9a91142a38	feat: print complete member info in etcd members Fixes: https://github.com/talos-systems/talos/issues/3487 Example output: ``` NODE ID HOSTNAME PEERS CLIENTS 10.5.0.2 c3d3020cf75b8728 talos-default-master-1 https://10.5.0.2:2380 https://10.5.0.2:2379 ``` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-04-17 11:07:59 -07:00
Andrey Smirnov	0bd8b0e800	feat: provide an option to recover etcd from data directory copy Sometimes `talosctl etcd snapshot` might not be available, for example when etcd is not healthy. In that case it's possible to copy raw etcd data directory with `talosctl cp /var/lib/etcd .` and use `member/snap/db` to recover the cluster. But such copy won't pass integrity checks, so they should be disabled explicitly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-14 08:25:32 -07:00
Alexey Palazhchenko	29da22d063	feat: add config validation warnings Closes #3412. Refs #3413. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2021-04-08 13:49:58 -07:00
Andrey Smirnov	e0650218a6	feat: support etcd recovery from snapshot on bootstrap When Talos `controlplane` node is waiting for a bootstrap, `etcd` contents can be recovered from a snapshot created with `talosctl etcd snapshot` on a healthy cluster. Bootstrap process goes same way as before, but the etcd data directory is recovered from the snapshot. This flow enables disaster recovery for the control plane: given that periodic backups are available, destroy control plane nodes, re-create them with the same config, and bootstrap one node with the saved snapshot to recover etcd state at the time of the snapshot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-08 10:15:37 -07:00
Andrey Smirnov	fbfd1eb2b1	refactor: pull new version of os-runtime, update code This is mostly refactoring to adapt to the new APIs. There are some small changes which are not user-visible immediately (but visible when using `talosctl get` to inspect low-level details): * `extras` namespace is removed, it was a hack to distinguish extra and system manifests * `Manifests` are managed by two controllers as shared outputs, stored in the `controlplane` namespace now * `talosctl inspect dependencies` output got slightly changed * resources now have `md.owner` set to the controller name which manages the resource Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-07 06:55:09 -07:00
Andrey Smirnov	e664362cec	feat: add API and command to save etcd snapshot (backup) This adds a simple API and `talosctl etcd snapshot` command to stream snapshot of etcd from one of the control plane nodes to the local file. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-04-02 09:20:16 -07:00
Artem Chernyshev	6ffabe5169	feat: add ability to find disk by disk properties Fixes: https://github.com/talos-systems/talos/issues/3323 Not exactly matching with udevd generated `by-<id>` symlinks, but should provide sufficient amount of property selectors to be able to pick specific disks for any kind of disk: sd card, hdd, ssd, nvme. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-23 14:23:02 -07:00
Artem Chernyshev	376fdcf6cb	feat: implement etcd remove-member cli command Fixes: https://github.com/talos-systems/talos/issues/3219 We already have `etcd leave`, which makes the node exclude itself from etcd members. But in case if the node can't remove itself because it doesn't have connection to etcd we need this etcd remove-member cli, which basically removes a node from a different node. No unit tests for that as it's going to destroy the test cluster. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2021-03-01 07:55:08 -08:00
Andrey Smirnov	589d01892c	fix: update the layout of the Disks API to match proxying requirements Fixes #3199 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-24 11:33:15 -08:00
Andrey Smirnov	7751920dba	feat: add a tool and package to convert self-hosted CP to static pods This is required to upgrade from Talos 0.8.x to 0.9.x. After the cluster is fully upgraded, control plane is still self-hosted (as it was bootstrapped with bootkube). Tool `talosctl convert-k8s` (and library behind it) performs the upgrade to self-hosted version. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 23:26:57 -08:00
Andrey Smirnov	e5bd35ae3c	feat: add resource watch API + CLI This uses API in `os-runtime` to pull the initial list of resources + updates for resource by type. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 13:24:47 -08:00
Andrey Smirnov	cc83b83808	feat: rename apply-config --no-reboot to --on-reboot This explains the intetion better: config is applied on reboot, and allows to easily distinguish it from `apply-config --immediate` which applies config immediately without a reboot (that is coming in a different PR). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-17 12:49:47 -08:00
Andrey Smirnov	d99a016af2	fix: correct response structure for GenerateConfig API Also fix recovery grpc handler to print panic stacktrace to the log. Any API should follow the structure compatible with apid proxying injection of errors/nodes. Explicitly fail GenerateConfig API on worker nodes, as it panics on worker nodes (missing certificates in node config). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-11 06:34:10 -08:00
Andrey Smirnov	edf5777222	feat: add an option to force upgrade without checks Our upgrades are safe by default - we check etcd health, take locks, etc. But sometimes upgrades might be a way to recover broken (or semi-broken) cluster, in that case we need upgrade to run even if the checks are not passing. This is not a safe way to do upgrades, but it might be a way to recover a cluster. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-02-09 10:20:03 -08:00
Andrey Smirnov	76a6794436	fix: kill all processes and umount all disk on reboot/shutdown There are several ways Talos node might be restarted or shut down: * error in sequence (initiated from machined) * panic in main goroutine (machined recovers panics) * error in sequence (initiated via API, event caught by machined) * reboot/shutdown via Talos API Before this change, paths (1) and (2) were handled in machined, and no disks were unmounted and processes killed, so technically all the processes are running and potentially writing to the filesystems. Paths (3) and (4) try to stop services (but not pods) and unmount explicitly mounted filesystems, followed by reboot directly from sequencer (bypassing machined handler). There was a bug that user disks were never explicitly unmounted (but they might have been unmounted if mounted on top `/var`). This refactors all the reboot/shutdown paths to flow through machined's main function: on paths (4) event is sent via event API from the sequencer back to the machined and machined initiates proper shutdown sequence. Refactoring in machined leads to all the paths (1)-(4) flowing through the same function `handle(error)`. Added two additional checks before flushing buffers: * kill all non-system processes, this also kills all mount namespaces * unmount any filesystem backed by `/dev/*` This ensures all filesystems are unmounted before buffers are flushed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-29 06:14:07 -08:00
Andrey Smirnov	0aaf8fa968	feat: replace bootkube with Talos-managed control plane Control plane components are running as static pods managed by the kubelets. Whole subsystem is managed via resources/controllers from os-runtime. Many supporting changes/refactoring to enable new code paths. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-26 14:22:35 -08:00
Andrey Smirnov	11863dd74d	feat: implement resource API in Talos This brings in `os-runtime` package and exposes resources with first iteration of read-only API. Two Talos resources (and one controller) are implemented: * legacy.Service resource tracks Talos 'service' `RUNNING` state * config.V1Alpha1 stores current runtime config Glue point between existing runtime and new os-runtime based runtime is in `v1alpha2` implementation and `V1Alpha2()` sub-interfaces of existing `Runtime`, `State`, `Controller` interfaces. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2021-01-19 11:45:46 -08:00
Alexey Palazhchenko	f3465b8e3e	feat: support type filter in list API and CLI Closes #2068. Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>	2020-12-24 06:34:02 -08:00
Andrey Smirnov	6a0e652f0c	fix: correctly transport gRPC errors from apid Before these changes, errors were always sent as strings, so if original error was gRPC error (which is almost always the case for apid), it is formatted as string and original fields (like code) are lost in the formatted string. With this change, apid sends errors as official `grpc.Status` protobuf structure, and client decodes that into Go grpc.Status based error. This change is backwards and forwards compatible. This should fix more cases when integration tests were not able to ignore grpc `transport is closing` errors when they were sent as strings from the apid endpoint. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-23 11:08:51 -08:00
Artem Chernyshev	a83e8758db	feat: add commands to manage/query etcd cluster Used already existing protobufs for that. Commands: `talosctl etcd members -n <node>` `talosctl etcd leave -n <node>` `talosctl etcd forfeit-leadership -n <node>` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-22 11:49:10 -08:00
Andrey Smirnov	54ed80e244	feat: reset with system disk wipe spec Idea is to add an option to perform "selective" reset: default reset operation is to wipe all partitions (triggering reinstall), while spec allows only to wipe some of the operations. Other operations are performed exactly in the same way for any reset flow. Possible use case: reset only `EPHEMERAL` partition. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-10 11:31:07 -08:00
Andrey Smirnov	350280eb59	feat: implement "staged" (failsafe/backup) upgrades Regular upgrade path takes just one reboot, but it requires all the processes to be stopped on the node before upgrade might proceed. Under some circumstances and with potential Talos bugs it might not work rendering Talos upgrades almost impossible. Staged upgrades build upon regular install flow to run the upgrade on the node reboot. Such upgrades require two reboots of the node, and it requires two pulls of the installer image, but they should be much less suspicious to the failure. Once the upgrade is staged, node can be rebooted in any possible way, including hard reset and upgrade is performed on the next boot. New ADV format was implemented as well to allow to store install image ref/options across reboots. New format allows for bigger values and takes 50% of the `META` partition. Old ADV is still kept for compatibility reasons. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-12-08 08:34:26 -08:00
Artem Chernyshev	5d48bd5f6a	feat: allow disabling NoSchedule taint on masters using TUI installer I think this should come handy for setting up single node SBC clusters. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-07 07:31:54 -08:00
Artem Chernyshev	63e0d02aa9	feat: add TUI for configuring network interfaces settings Allows configuring: - cidr. - dhcp enable/disable. - MTU. - Ignore. - Dhcp metric. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 11:05:55 -08:00
Artem Chernyshev	c7062e3f4d	feat: make GenerateConfiguration accept current time as a parameter If the node time is out of sync, it can generate incorrect configuration. And maintenance mode does not allow us starting ntp, because there is no containerd. By providing current UTC time of the machine where talosctl client is running, it is possible to force GenerateConfiguration use correct time. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-12-03 08:28:11 -08:00
Artem Chernyshev	f96cffd2b2	feat: add ability to choose CNI config Initial version which only allows setting CNI using preset, no custom CNI urls are supported at the moment. Still need to figure out what kind of UI can be used for that. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-26 06:49:54 -08:00
Andrey Smirnov	9a32e34cb1	feat: implement apply configuration without reboot This allows config to be written to disk without being applied immediately. Small refactoring to extract common code paths. At first, I tried to implement this via the sequencer, but looks like it's too hard to get it right, as sequencer lacks context and config to be written is not applied to the runtime. Fixes #2828 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-23 12:42:44 -08:00
Artem Chernyshev	8513123d22	feat: return client config as the second value in GenerateConfiguration To be used in interactive installer to output the node client configuration to a file. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-17 07:20:05 -08:00
Artem Chernyshev	0f924b5122	feat: add generate config gRPC API Fixes: https://github.com/talos-systems/talos/issues/2766 This API is implemented in Maintenance and Machine services. Can be used to generate configuration on the node, instead of using talosctl to generate it locally. To be used in interactive installer and talosctl gen config. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-13 08:07:32 -08:00
Artem Chernyshev	93e30a1738	chore: remove maintenance service interface and use machine service Now maintenance service implements `MachineService` interface, stubbing all not implemented methods. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-11-11 12:33:44 -08:00
Andrew Rynhard	71321214a1	feat: add storage API This is the initial implementation of a storage API. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-11 10:12:25 -08:00
Andrey Smirnov	026244097a	refactor: drop osd compatibility layer Fixes #2761 Service `osd` was merged into machined on Jul, 13th, before 0.6 release. It's time to drop the backwards compatibility with clients before 0.6. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-11-11 09:38:19 -08:00
Andrew Rynhard	562f816526	refactor: use gRPC for interactive installation Instead of hosting a web service, we decided to implement a gRPC service that exposes APIs that can be used in a client-side interactive installer. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-11-03 08:36:44 -08:00
Artem Chernyshev	e7e99cf1b3	feat: support disk usage command in talosctl Usage example: ```bash talosctl du --nodes 10.5.0.2 /var -H -d 2 NODE NAME 10.5.0.2 8.4 kB etc 10.5.0.2 1.3 GB lib 10.5.0.2 16 MB log 10.5.0.2 25 kB run 10.5.0.2 4.1 kB tmp 10.5.0.2 1.3 GB . ``` Supported flags: - `-a` writes counts for all files, not just directories. - `-d` recursion depth - '-H' humanize size outputs. - '-t' size threshold (skip files if < size or > size). Fixes: https://github.com/talos-systems/talos/issues/2504 Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-10-13 09:30:31 -07:00
Andrew Rynhard	4eeef28e90	feat: add etcd API This adds RPCs for basic etcd management tasks. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-10-06 11:30:04 -07:00
Seán C McCord	ff92d2a14b	feat: add ApplyConfiguration API Adds the ability to apply (replace) an existing node configuration with a new one via the Machine API. Fixes #2345 Signed-off-by: Seán C McCord <ulexus@gmail.com>	2020-09-29 14:44:06 -07:00
Andrey Smirnov	bddd4f1bf6	refactor: move external API packages into `machinery/` This moves `pkg/config`, `pkg/client` and `pkg/constants` under `pkg/machinery` umbrella. And `pkg/machinery` is published as Go module inside Talos repository. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-08-17 09:56:14 -07:00
Andrey Smirnov	74413b1393	fix: ignore sequence lock errors in machined This prevents reboots when some actions triggers sequence while another sequence is still running. Fixes #2209 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 14:36:06 -07:00
Andrey Smirnov	ad99cb6421	feat: implement talosctl dashboard command This builds a simple CLI UI for Talos cluster monitoring. Some new APIs were added for monitoring based on Prometheus procfs package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 14:24:04 -07:00

1 2 3 4

190 Commits