2137 Commits

Author SHA1 Message Date
Andrey Smirnov
574d48e540
fix: use image digest when starting a container
First of all, it seems to be "right way", as it makes sure the image is
looked up by the digest.

Second, it fixes the case when image is specified with both tag and
digest (which is not supposed to be the correct ref, but it is used
frequently).

Talos since 1.5.0 stores images with the following aliases:

```
gcr.io/etcd-development/etcd:v3.5.9
gcr.io/etcd-development/etcd@sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
ghcr.io/siderolabs/kubelet:v1.28.0
ghcr.io/siderolabs/kubelet@sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
```

This change pulls the digest format (the last in this list) and uses it
to start a container.

Fixes #7640

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-21 15:48:59 +04:00
Noel Georgi
6b0373ebef
chore: move bash tests to integration
move extensions and secureboot tests to integration.
Makes it easier to test.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-17 19:58:35 +05:30
Andrey Smirnov
86c94eff8d
refactor: docgen and config examples
Short version is: move from global variables/`init()` function into
explicit functions.

`docgen` was updated to skip creating any top-level global variables,
now `Doc` information is generated on the fly when it is accessed.
Talos itself doesn't marshal the configuration often, so in general it
should never be accessed for Talos (but will be accessed e.g. for
`talosctl`).

Machine config examples were changed manually from variables to
functions returning a value and moved to a separate file.

There are no changes to the output of `talosctl gen config`.

There is a small change to the generated documentation, which I believe
is a correct one, as previously due to value reuse it was clobbered with
other data.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-10 14:56:01 +04:00
Andrey Smirnov
ee6d639f6c
fix: match routes on the priority properly
Fixes #7592

The problem was a mismatch between a "primary key" (ID) of the
`RouteSpec` and the way routes are looked up in the kernel - with two
idential routes but different priority Talos would end up in an infinite
loop fighting to remove and re-add back same route, as priority never
matches.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-10 14:29:47 +04:00
Dmitriy Matrenichev
c4a1ca8d61
chore: remove <-errCh where possible in grpc methods
Simplify code by passing error directly into the pipe closer.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-07 22:28:58 +03:00
Andrey Smirnov
e0f383598e
chore: clean up the output of the imager
Use `Progress`, and options to pass around the way messages are written.

Fixed some tiny issues in the code, but otherwise no functional changes.

To make colored output work with `docker run`, switched back image
generation to use volume mount for output (old mode is still
functioning, but it's not the default, and it works when docker is not
running on the same host).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-07 16:00:14 +04:00
Andrey Smirnov
fb536af4d1
chore: optimize memory usage of tcell library on init
There are two changes here:

* build `machined` binary with `tcell_minimal` tag (which disables
  loading some parts of the terminfo database), which also affects
  `apid`, `trustd` and `dashboard` processes, as they run from the same
  executable; in `dashboard` explicitly import `linux` terminal we're
  using when the `dashboard` runs on the machine
* pass `TCELL_MINIMIZE=1` environment variable to each Talos process
  which removes 0.5MiB of runewdith allocation for a lookup table

See #7578

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-04 17:59:18 +04:00
Artem Chernyshev
7d688ccfeb
fix: make encryption config provider default to luks2 if not set
Fixes: https://github.com/siderolabs/talos/issues/7515

Rename `Kind` to `Provider` in the `v1alpha1_provider`.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-08-04 12:20:55 +03:00
Dmitriy Matrenichev
80238a05a6
chore: unify semver under github.com/blang/semver/v4
Currently, we use `github.com/coreos/go-semver/semver` and `github.com/hashicorp/go-version`
for version parsing. As we use `github.com/blang/semver/v4` in our other projects, and it
has more features, it makes sense to use it across the projects. It also doesn't allocate
like crazy in `KubernetesVersion.SupportedWith`.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-04 00:29:52 +03:00
Andrey Smirnov
0f1920bdda
chore: provide a resource to peek into Linux clock adjustments
This is a follow-up for #7567, which won't be backported to 1.5.

This allows to get an output like:

```
$ talosctl -n 172.20.0.5 get adjtimestatus -w
NODE         *   NAMESPACE TYPE            ID     VERSION   OFFSET        ESTERROR   MAXERROR   STATUS               SYNC
172.20.0.5   +   runtime   AdjtimeStatus   node   47        -18.14306ms   0s         191.5ms    STA_PLL | STA_NANO   true
172.20.0.5       runtime   AdjtimeStatus   node   48        -17.109555ms  0s         206.5ms    STA_NANO | STA_PLL   true
172.20.0.5       runtime   AdjtimeStatus   node   49        -16.134923ms  0s         221.5ms    STA_NANO | STA_PLL   true
172.20.0.5       runtime   AdjtimeStatus   node   50        -15.21581ms   0s         236.5ms    STA_PLL | STA_NANO   true
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 22:06:53 +04:00
Andrey Smirnov
4eab3017b0
fix: calculate log2i properly
Fixes #7080

The real bug was off-by-one in `log2i` implementation, other changes are
cleanups as `x/sys/unix` package now contains all the constants we need.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 21:17:58 +04:00
Jared Davenport
bcf2845307
fix: update providerid prefix for aws
This PR updates the ProviderID format for aws resources. There seems to
be a bug when using Talos CCM (which consumes this value from Talos)
because the format is `aws://x/y` (two slashes) vs. the expected
`aws:///x/y` (three slashes) that is set with the AWS CCM code
[here](d055109367/pkg/providers/v1/instances.go (L47-L53)).

Setting only two slashes causes important software in the workload
cluster to fail, specifically cluster-autoscaler. The regex they use for
pulling providerID is [here](702e9685d6/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go (L195)).

Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
2023-08-03 10:21:56 -04:00
Andrey Smirnov
793dcedc95
fix: fast-wipe the system disk on talosctl reset
Fixes #7558

I see no reason to keep old behavior (removing all partitions on the
disk), as it's only compatible with Talos itself.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 16:28:59 +04:00
Andrey Smirnov
87fe8f1a2a
feat: implement image generation profiles
Support full configuration for image generation, including image
outputs, support most features (where applicable) for all image output
types, unify image generation process.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-02 19:13:44 +04:00
Andrei Kvapil
10f958cf41
feat: network configuration improvements on the NoCloud platform
* support for bonding
* added interface selection by MAC address
* added routes management

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-02 15:03:33 +04:00
Dmitriy Matrenichev
abf3831174
chore: remove cpu_manager_state on cpuManagerPolicy change
After we closed `kubelet`, remove `/var/lib/kubelet/cpu_manager_state` if there are any changes in `cpuManagerPolicy`.
We do not add any other safeguards, so it's user responsibility to cordon/drain the node in advance.

Also minor fixes in other files.

Closes #7504

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-01 18:53:04 +03:00
Noel Georgi
68e6b98f7d
feat: add security state resource
Add security state resource that describes the state of Talos SecureBoot
and PCR signing key fingerprints.

The UKI fingerprint is currently not populated.

Fixes: #7514

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-31 22:02:08 +05:30
Andrey Smirnov
a17272cdda
chore: update hcloud API SDK to v2
There are no functional changes, but SDK got updated to handle int ->
int64 changes. v1 version is only supported to Sep 2023.

See https://github.com/hetznercloud/hcloud-go#support

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 19:00:10 +04:00
Andrey Smirnov
6d71bb8df2
refactor: replace google/gopacket with gopacket/gopacket
This new fork seems to be more active. The change itself doesn't fix any
memory allocation, but I submitted a PR for gopacket/gopacket:

https://github.com/gopacket/gopacket/pull/24

Also fix crazy alloc in `tui/components` (this is only relevant for
`talosctl`).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 17:34:15 +04:00
Andrey Smirnov
846f37d84c
refactor: drop dependency on vmware/govmomi
This module was imported just for a single Go struct (for XML
unmarshalling), and it could be easily internalized.

The module causes significant allocation on startup:

```
init github.com/vmware/govmomi/vim25/types @23 ms, 1.4 ms clock, 1269864 bytes, 196 allocs
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 16:49:34 +04:00
Andrey Smirnov
ca0b32c514
refactor: update AWS SDK and http-getter to v2 versions
Both are much modular and pull in much less dependendencies in to the
Talos tree.

This solves the problem with allocations in AWS endpoints on import, and
removes a bunch of dependencies.

Raw binary size: -10 MiB.

Memory usage (not scientific): around -5 MiB for all Talos services.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 15:30:02 +04:00
Noel Georgi
a3a2aa8ef3
fix: use fast wipe for upgrade
As part of bootloader refactoring `go-blockdevice` was used for wiping
partitions in #7329, but used standard wipe which could be fast/slow
depending on the blockdevice support. Switch to using fast-wipe for
partitions. This should not affect `wipe` option in machineconfig.

Fixes: #7531

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-27 21:09:06 +05:30
Utku Ozdemir
355681ddab
fix: terminate dashboard gracefully on & switch back to tty1
- Make dashboard SIGTERM-aware
- Handle panics on dashboard and terminate it gracefully, so it resets the terminal properly
- Switch to TTY2 when it starts and back to TTY1 when it stops.

Closes siderolabs/talos#7516.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-07-27 16:00:23 +02:00
Andrey Smirnov
544cb4fe7d
refactor: accept partial machine configuration
This refactors code to handle partial machine config - only multi-doc
without v1alpha1 config.

This uses improvements from
https://github.com/cosi-project/runtime/pull/300:

* where possible, use `TransformController`
* use integrated tracker to reduce boilerplate

Sometimes fix/rewrite tests where applicable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-27 17:00:42 +04:00
Andrey Smirnov
786e86f5b8
refactor: rewrite the way Talos acquires the machine configuration
Fixes #7453

The goal is to make it possible to load some multi-doc configuration
from the platform source (or persisted in STATE) before machine acquires
full configuration.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-24 14:26:42 +04:00
Andrey Smirnov
9ef4e5efca
fix: log explicitly when kubelet has no nodeIP match
Fixes #7487

When `.kubelet.nodeIP` filters yield no match, Talos should not start
the kubelet, as using empty address list results in `--node-ip=` empty
kubelet arg, which makes kubelet pick up "the first" address.

Instead, skip updating (creating) the nodeIP and log an explicit
warning.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-20 00:41:47 +04:00
Andrey Smirnov
6b39c6a4d3
fix: enable compression and bump gRPC max msg size
Fixes #7482

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-19 22:46:37 +04:00
Noel Georgi
2f2eca8617
chore: basic support for shutdown/poweroff flags
This adds basic support for shutdown/poweroff flags.
it can distringuish between halt/shutdown/reboot.

In the case of Talos halt/shutdown is same op.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-19 23:35:32 +05:30
Noel Georgi
59d7d9344b
chore: use machined for shutdown, poweroff
Use the `machined` socket for `shutdown` and `poweroff` aliases. This
ensures that worker nodes does not have to wait on apid to start.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-19 21:48:15 +05:30
Dmitriy Matrenichev
2439bfb719
chore: explicitly add timestamps to machined logs
We can safely do it on `io.Writer` level, since `log.Logger.Output` (called by `Print|Printf`) pretty much promises
that every call to `Write` ends with `\n`.

Closes #7439

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-07-19 18:29:17 +03:00
Noel Georgi
14966e718a
fix: skip over tpm2 1.2 devices
For rng seed and pcr extend, let's ignore if the device is not TPM2.0
based. Seal/Unseal operations would still error out since it's
explicitly user enabled feature.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-18 12:58:45 +05:30
Noel Georgi
166d75fe88
fix: tpm2 encrypt/decrypt flow
The previous flow was using TPM PCR 11 values to bound the policy which
means TPM cannot unseal when UKI changes. Now it's fixed to use PCR 7
which is bound to the SecureBoot state (SecureBoot status and
Certificates). This provides a full chain of trust bound to SecureBoot
state and signed PCR signature.

Also the code has been refactored to use PolicyCalculator from the TPM
library.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-14 23:58:59 +05:30
Dmitriy Matrenichev
5f34f5b41f
chore: rename api load balancer to KubePrism
Closes #7432

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-07-14 15:23:53 +03:00
Andrey Smirnov
c8b7095c01
refactor: use tpm2 library to calculate policy hash
No real change, just using library to do the work (should be more
readable).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-14 15:07:47 +04:00
Andrey Smirnov
53873b8444
refactor: move ukify into Talos code
This is intemediate step to move parts of the `ukify` down to the main
Talos source tree, and call it from `talosctl` binary.

The next step will be to integrate it into the imager and move `.uki`
build out of the Dockerfile.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-13 19:14:32 +04:00
Noel Georgi
79365d9bac
feat: tpm2 based disk encryption
Support disk encryption using tpm2 and pre-calculated signed PCR values.

Fixes: #7266

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-12 20:41:28 +05:30
Andrey Smirnov
06369e8195
fix: retry CRI pod removal, fix upgrade flow in the tests
It seems that CRI has a bit of eventual consistency, and it might fail
to remove a stopped pod failing that it's still running.

Rewrite the upgrade API call in the upgrade test to actually wait for
the upgrade to be successful, and fail immediately if it's not
successful. This should improve the test stability and it should make
it easier to find issues immediately.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-12 16:20:10 +04:00
Andrey Smirnov
8017afb107
feat: implement CRI image management and pre-pull on K8s upgrade
Fixes #6391

Implement a set of APIs and commands to manage images in the CRI, and
pre-pull images on Kubernetes upgrades.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-11 19:25:10 +04:00
Andrey Smirnov
1c2f19b367
feat: update Kubernetes to 1.28.0-alpha.4
The Go modules were not tagged for alpha.4, so using alpha.3 tag.

Talos 1.5 will ship with Kubernetes 1.28.0.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-11 15:40:24 +04:00
Artem Chernyshev
936111ce06
fix: properly set up tls for KMS endpoint
The condition was inverted 🤦

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-07-10 21:10:02 +03:00
Artem Chernyshev
cb226eec46
fix: rewrite encryption system information flow
Pass getter to the key handler instead of already fetched node uuid.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-07-10 19:07:46 +03:00
Andrey Smirnov
bd4f89f633
fix: disable dashboard on Azure, GCP and Scaleway
Fixes #7416

These platforms don't have video console access.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-10 17:05:56 +04:00
Andrey Smirnov
bdb96189fa
refactor: make maintenance service controller-based
Fixes #7430

Introduce a set of resources which look similar to other API
implementations: CA, certs, cert SANs, etc.

Introduce a controller which manages the service based on resource
state.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-10 15:41:52 +04:00
Andrey Smirnov
d23d04de2a
feat: seed the kernel random pool from the TPM
Use the TPM2 feature to provide high-quality random bytes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-07 23:51:11 +04:00
LukasAuerbeck
c81ce8cfb0
feat: support controlplane resources configuration
Fixes #7379

Add possibility to configure the controlplane static pod resources via
APIServer, ControllerManager and Scheduler configs.

Signed-off-by: LukasAuerbeck <17929465+LukasAuerbeck@users.noreply.github.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-07 22:44:56 +04:00
Andrey Smirnov
74de562b29
fix: mount hugepages with nosuid + nodev
Fixes #7445

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-07 21:57:19 +04:00
Artem Chernyshev
ce63abb219
feat: add KMS assisted encryption key handler
Talos now supports new type of encryption keys which rely on Sealing/Unsealing randomly generated bytes with a KMS server:

```
systemDiskEncryption:
  ephemeral:
    keys:
      - kms:
          endpoint: https://1.2.3.4:443
        slot: 0
```
gRPC API definitions and a simple reference implementation of the KMS server can be found in this
[repository](https://github.com/siderolabs/kms-client/blob/main/cmd/kms-server/main.go).

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-07-07 19:02:39 +03:00
Andrey Smirnov
6be5a13d5d
feat: implement machine config documents for event and log streaming
Fixes #7228

Add some changes to make Talos accept partial machine configuration
without main v1alpha1 config.

With this change, it's possible to connect a machine already running
with machine configuration (v1alpha1), the following patch will connect
to a local SideroLink endpoint:

```yaml
apiVersion: v1alpha1
kind: SideroLinkConfig
apiUrl: grpc://172.20.0.1:4000/?jointoken=foo
---
apiVersion: v1alpha1
kind: KmsgLogConfig
name: apiSink
url: tcp://[fdae:41e4:649b:9303::1]:4001/
---
apiVersion: v1alpha1
kind: EventSinkConfig
endpoint: "[fdae:41e4:649b:9303::1]:8080"
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-01 00:22:44 +04:00
James Callahan
c02ada7d95
fix: capabilities including ALL should be uppercase
Pod security standard requires that ALL is in caps

Signed-off-by: James Callahan <james@wavesquid.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-29 12:58:30 +05:30
Noel Georgi
cbdf96d461
feat: support environment file for extensions
Supports setting `environmentFile` for Talos System Extension Services.

Fixes: #7316

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-28 00:21:13 +05:30