1853 Commits

Author SHA1 Message Date
Andrey Smirnov
79bbdf454e
fix: set proper timeouts for KubePrism loadbalancer
The default timeouts are very aggressive, and we should use explicit
timeouts so that healh checks don't run that often.

Fixes #7690

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-09-01 00:16:09 +04:00
Andrey Smirnov
b8fb55d5c2
fix: use a mount prefix when installing a bootloader
This is not a problem in general, but when running multiple image
generation procedures using the same mount point is a problem.

This is a no-op if `MountPrefix` is not set (when installing/upgrading
vs. creating an image).

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-31 22:21:41 +04:00
Andrey Smirnov
2d3ac925ea
refactor: update NTP spike detector
See https://github.com/siderolabs/talos/issues/7080#issuecomment-1696105986

The NTP spike detector code was refactored out of the main NTP code so
that it can be unit-tested.

I dropped one check which I think is causing false-positives in the
spike detector (when NTP offset is higher than the RTT of the best
packet received so far).

The overall flow resembles the one in systemd-timesync, the current
implementation has this check:

6639ac474e/src/timesync/timesyncd-manager.c (L357-L360)

This check was introduced in the initial release, after some
refactoring:

3dbc762003 (diff-4aa9995f07bb31b9884d40a7634f5f6d30245dfd26ac27b89cd5fd3bd4eef56aR429-R431)

There is no equivalent of it in the RFC:

https://datatracker.ietf.org/doc/html/rfc5905#appendix-A.5.2

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-29 20:56:42 +04:00
Noel Georgi
d03dc7a8af
chore: validate new system extensions
Validate the amdgpu and intel-ice firmware extensions.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-25 17:18:19 +05:30
Andrey Smirnov
3c9f7a7de6
chore: re-enable nolintlint and typecheck linters
Drop startup/rand.go, as since Go 1.20 `rand.Seed` is done
automatically.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-25 01:05:41 +04:00
Andrey Smirnov
8670450d28
release(v1.6.0-alpha.0): prepare release
This is the official v1.6.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-24 17:09:34 +04:00
Noel Georgi
6778ded29d
feat: add e2e-aws for nvidia extensions
Add e2e tests for nvidia

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-24 17:43:36 +05:30
Andrey Smirnov
74c07ed714
chore: update Go to 1.21
This fixes a problem in the `RouteSpecController` which is due to a
subtle (but correct) change in the behavior in the `stdlib`.

Also some small (but should be safe) bumps.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-23 22:52:04 +04:00
Andrey Smirnov
c0ea4d7ba5
fix: properly calculate overal of node address with subnet filters
Example: host has address `10.0.0.1/8`, while Kubernetes pod CIDR is
`10.244.0.0/16`. These two subnets overlap, but the address `10.0.0.1`
isn't contained in the `10.244.0.0/16` subnet.

This change fixes the check to make sure address is not contained vs.
the address subnet overlaps with the filter.

NB: this is still a bad idea to have host network subnet to overlap with
Kubernetes pod/service CIDRs.

Also refactor the unit-tests to use new (better ways) to do assertions.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-23 21:16:58 +04:00
Noel Georgi
d6b2719e2e
chore: drone: move extensions step to a function
Move drone extensions integration to a function. This allows us to
re-use the code and just depend on a single step rather than explicitly
defining all dependencies.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-23 20:56:43 +05:30
Andrey Smirnov
9608ef56dc
chore: allow bridge traffic with DHCP broadcast traffic
This is required for https://github.com/siderolabs/sidero/pull/1070, as
we need to allow DHCP traffic from Sidero controller running in a VM
through the bridge to other VMs.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-23 18:37:37 +04:00
Noel Georgi
833895940b
chore: add tests for zfs extension
Add tests for ZFS and btrfs extensions.
Also fix the e2e-aws cron pipeline.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-23 11:16:25 +05:30
Utku Ozdemir
ea0d6e8c6a
fix: prevent dashboard crashes when process info is not available
Processes and their info are not guaranteed to be present on the api-based data gathered by the dashboard. Therefore, we switch to using nil-safe access to the CPU time when rendering the process table.

Closes siderolabs/talos#7645.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-08-22 12:55:36 +02:00
Andrey Smirnov
e9077a6fb9
feat: filter the hostname to produce nodename
Fixes #7615

This extends the previous handling when Talos did `ToLower()` on the
hostname to do the full filtering as expected.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-22 12:41:57 +04:00
Andrey Smirnov
dc8361c1d5
fix: properly GC images supplied with both tag and digest
This is a follow-up fix for #7640

I noticed that image cleanup controller cleans up the images if
specified with both tag and digest.

The problem was incorrectly building image references in the expected
set of images, so they were incorrectly marked as unused.

Refactor the code to make the core part testable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-21 21:04:24 +04:00
Andrey Smirnov
b56e8b7d9b
fix: support 'List' type manifests
Fixes #7636

This support a `List`-type manifests by unwrapping them into individual
objects.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-21 16:48:37 +04:00
Andrey Smirnov
574d48e540
fix: use image digest when starting a container
First of all, it seems to be "right way", as it makes sure the image is
looked up by the digest.

Second, it fixes the case when image is specified with both tag and
digest (which is not supposed to be the correct ref, but it is used
frequently).

Talos since 1.5.0 stores images with the following aliases:

```
gcr.io/etcd-development/etcd:v3.5.9
gcr.io/etcd-development/etcd@sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
ghcr.io/siderolabs/kubelet:v1.28.0
ghcr.io/siderolabs/kubelet@sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
```

This change pulls the digest format (the last in this list) and uses it
to start a container.

Fixes #7640

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-21 15:48:59 +04:00
Noel Georgi
6b0373ebef
chore: move bash tests to integration
move extensions and secureboot tests to integration.
Makes it easier to test.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-08-17 19:58:35 +05:30
Andrey Smirnov
86c94eff8d
refactor: docgen and config examples
Short version is: move from global variables/`init()` function into
explicit functions.

`docgen` was updated to skip creating any top-level global variables,
now `Doc` information is generated on the fly when it is accessed.
Talos itself doesn't marshal the configuration often, so in general it
should never be accessed for Talos (but will be accessed e.g. for
`talosctl`).

Machine config examples were changed manually from variables to
functions returning a value and moved to a separate file.

There are no changes to the output of `talosctl gen config`.

There is a small change to the generated documentation, which I believe
is a correct one, as previously due to value reuse it was clobbered with
other data.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-10 14:56:01 +04:00
Andrey Smirnov
ee6d639f6c
fix: match routes on the priority properly
Fixes #7592

The problem was a mismatch between a "primary key" (ID) of the
`RouteSpec` and the way routes are looked up in the kernel - with two
idential routes but different priority Talos would end up in an infinite
loop fighting to remove and re-add back same route, as priority never
matches.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-08-10 14:29:47 +04:00
Dmitriy Matrenichev
c4a1ca8d61
chore: remove <-errCh where possible in grpc methods
Simplify code by passing error directly into the pipe closer.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-07 22:28:58 +03:00
Andrey Smirnov
e0f383598e
chore: clean up the output of the imager
Use `Progress`, and options to pass around the way messages are written.

Fixed some tiny issues in the code, but otherwise no functional changes.

To make colored output work with `docker run`, switched back image
generation to use volume mount for output (old mode is still
functioning, but it's not the default, and it works when docker is not
running on the same host).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-07 16:00:14 +04:00
Andrey Smirnov
fb536af4d1
chore: optimize memory usage of tcell library on init
There are two changes here:

* build `machined` binary with `tcell_minimal` tag (which disables
  loading some parts of the terminfo database), which also affects
  `apid`, `trustd` and `dashboard` processes, as they run from the same
  executable; in `dashboard` explicitly import `linux` terminal we're
  using when the `dashboard` runs on the machine
* pass `TCELL_MINIMIZE=1` environment variable to each Talos process
  which removes 0.5MiB of runewdith allocation for a lookup table

See #7578

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-04 17:59:18 +04:00
Artem Chernyshev
7d688ccfeb
fix: make encryption config provider default to luks2 if not set
Fixes: https://github.com/siderolabs/talos/issues/7515

Rename `Kind` to `Provider` in the `v1alpha1_provider`.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-08-04 12:20:55 +03:00
Dmitriy Matrenichev
80238a05a6
chore: unify semver under github.com/blang/semver/v4
Currently, we use `github.com/coreos/go-semver/semver` and `github.com/hashicorp/go-version`
for version parsing. As we use `github.com/blang/semver/v4` in our other projects, and it
has more features, it makes sense to use it across the projects. It also doesn't allocate
like crazy in `KubernetesVersion.SupportedWith`.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-04 00:29:52 +03:00
Andrey Smirnov
0f1920bdda
chore: provide a resource to peek into Linux clock adjustments
This is a follow-up for #7567, which won't be backported to 1.5.

This allows to get an output like:

```
$ talosctl -n 172.20.0.5 get adjtimestatus -w
NODE         *   NAMESPACE TYPE            ID     VERSION   OFFSET        ESTERROR   MAXERROR   STATUS               SYNC
172.20.0.5   +   runtime   AdjtimeStatus   node   47        -18.14306ms   0s         191.5ms    STA_PLL | STA_NANO   true
172.20.0.5       runtime   AdjtimeStatus   node   48        -17.109555ms  0s         206.5ms    STA_NANO | STA_PLL   true
172.20.0.5       runtime   AdjtimeStatus   node   49        -16.134923ms  0s         221.5ms    STA_NANO | STA_PLL   true
172.20.0.5       runtime   AdjtimeStatus   node   50        -15.21581ms   0s         236.5ms    STA_PLL | STA_NANO   true
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 22:06:53 +04:00
Andrey Smirnov
4eab3017b0
fix: calculate log2i properly
Fixes #7080

The real bug was off-by-one in `log2i` implementation, other changes are
cleanups as `x/sys/unix` package now contains all the constants we need.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 21:17:58 +04:00
Jared Davenport
bcf2845307
fix: update providerid prefix for aws
This PR updates the ProviderID format for aws resources. There seems to
be a bug when using Talos CCM (which consumes this value from Talos)
because the format is `aws://x/y` (two slashes) vs. the expected
`aws:///x/y` (three slashes) that is set with the AWS CCM code
[here](d055109367/pkg/providers/v1/instances.go (L47-L53)).

Setting only two slashes causes important software in the workload
cluster to fail, specifically cluster-autoscaler. The regex they use for
pulling providerID is [here](702e9685d6/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go (L195)).

Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
2023-08-03 10:21:56 -04:00
Andrey Smirnov
793dcedc95
fix: fast-wipe the system disk on talosctl reset
Fixes #7558

I see no reason to keep old behavior (removing all partitions on the
disk), as it's only compatible with Talos itself.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-03 16:28:59 +04:00
Andrey Smirnov
87fe8f1a2a
feat: implement image generation profiles
Support full configuration for image generation, including image
outputs, support most features (where applicable) for all image output
types, unify image generation process.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-02 19:13:44 +04:00
Andrei Kvapil
10f958cf41
feat: network configuration improvements on the NoCloud platform
* support for bonding
* added interface selection by MAC address
* added routes management

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-08-02 15:03:33 +04:00
Dmitriy Matrenichev
abf3831174
chore: remove cpu_manager_state on cpuManagerPolicy change
After we closed `kubelet`, remove `/var/lib/kubelet/cpu_manager_state` if there are any changes in `cpuManagerPolicy`.
We do not add any other safeguards, so it's user responsibility to cordon/drain the node in advance.

Also minor fixes in other files.

Closes #7504

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-08-01 18:53:04 +03:00
Noel Georgi
68e6b98f7d
feat: add security state resource
Add security state resource that describes the state of Talos SecureBoot
and PCR signing key fingerprints.

The UKI fingerprint is currently not populated.

Fixes: #7514

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-31 22:02:08 +05:30
Andrey Smirnov
a17272cdda
chore: update hcloud API SDK to v2
There are no functional changes, but SDK got updated to handle int ->
int64 changes. v1 version is only supported to Sep 2023.

See https://github.com/hetznercloud/hcloud-go#support

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 19:00:10 +04:00
Andrey Smirnov
6d71bb8df2
refactor: replace google/gopacket with gopacket/gopacket
This new fork seems to be more active. The change itself doesn't fix any
memory allocation, but I submitted a PR for gopacket/gopacket:

https://github.com/gopacket/gopacket/pull/24

Also fix crazy alloc in `tui/components` (this is only relevant for
`talosctl`).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 17:34:15 +04:00
Andrey Smirnov
846f37d84c
refactor: drop dependency on vmware/govmomi
This module was imported just for a single Go struct (for XML
unmarshalling), and it could be easily internalized.

The module causes significant allocation on startup:

```
init github.com/vmware/govmomi/vim25/types @23 ms, 1.4 ms clock, 1269864 bytes, 196 allocs
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 16:49:34 +04:00
Andrey Smirnov
ca0b32c514
refactor: update AWS SDK and http-getter to v2 versions
Both are much modular and pull in much less dependendencies in to the
Talos tree.

This solves the problem with allocations in AWS endpoints on import, and
removes a bunch of dependencies.

Raw binary size: -10 MiB.

Memory usage (not scientific): around -5 MiB for all Talos services.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-28 15:30:02 +04:00
Noel Georgi
a3a2aa8ef3
fix: use fast wipe for upgrade
As part of bootloader refactoring `go-blockdevice` was used for wiping
partitions in #7329, but used standard wipe which could be fast/slow
depending on the blockdevice support. Switch to using fast-wipe for
partitions. This should not affect `wipe` option in machineconfig.

Fixes: #7531

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-27 21:09:06 +05:30
Utku Ozdemir
355681ddab
fix: terminate dashboard gracefully on & switch back to tty1
- Make dashboard SIGTERM-aware
- Handle panics on dashboard and terminate it gracefully, so it resets the terminal properly
- Switch to TTY2 when it starts and back to TTY1 when it stops.

Closes siderolabs/talos#7516.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-07-27 16:00:23 +02:00
Andrey Smirnov
544cb4fe7d
refactor: accept partial machine configuration
This refactors code to handle partial machine config - only multi-doc
without v1alpha1 config.

This uses improvements from
https://github.com/cosi-project/runtime/pull/300:

* where possible, use `TransformController`
* use integrated tracker to reduce boilerplate

Sometimes fix/rewrite tests where applicable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-27 17:00:42 +04:00
Andrey Smirnov
786e86f5b8
refactor: rewrite the way Talos acquires the machine configuration
Fixes #7453

The goal is to make it possible to load some multi-doc configuration
from the platform source (or persisted in STATE) before machine acquires
full configuration.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-24 14:26:42 +04:00
Andrey Smirnov
9ef4e5efca
fix: log explicitly when kubelet has no nodeIP match
Fixes #7487

When `.kubelet.nodeIP` filters yield no match, Talos should not start
the kubelet, as using empty address list results in `--node-ip=` empty
kubelet arg, which makes kubelet pick up "the first" address.

Instead, skip updating (creating) the nodeIP and log an explicit
warning.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-20 00:41:47 +04:00
Andrey Smirnov
6b39c6a4d3
fix: enable compression and bump gRPC max msg size
Fixes #7482

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-19 22:46:37 +04:00
Noel Georgi
2f2eca8617
chore: basic support for shutdown/poweroff flags
This adds basic support for shutdown/poweroff flags.
it can distringuish between halt/shutdown/reboot.

In the case of Talos halt/shutdown is same op.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-19 23:35:32 +05:30
Noel Georgi
59d7d9344b
chore: use machined for shutdown, poweroff
Use the `machined` socket for `shutdown` and `poweroff` aliases. This
ensures that worker nodes does not have to wait on apid to start.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-19 21:48:15 +05:30
Dmitriy Matrenichev
2439bfb719
chore: explicitly add timestamps to machined logs
We can safely do it on `io.Writer` level, since `log.Logger.Output` (called by `Print|Printf`) pretty much promises
that every call to `Write` ends with `\n`.

Closes #7439

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-07-19 18:29:17 +03:00
Noel Georgi
14966e718a
fix: skip over tpm2 1.2 devices
For rng seed and pcr extend, let's ignore if the device is not TPM2.0
based. Seal/Unseal operations would still error out since it's
explicitly user enabled feature.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-18 12:58:45 +05:30
Noel Georgi
166d75fe88
fix: tpm2 encrypt/decrypt flow
The previous flow was using TPM PCR 11 values to bound the policy which
means TPM cannot unseal when UKI changes. Now it's fixed to use PCR 7
which is bound to the SecureBoot state (SecureBoot status and
Certificates). This provides a full chain of trust bound to SecureBoot
state and signed PCR signature.

Also the code has been refactored to use PolicyCalculator from the TPM
library.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-07-14 23:58:59 +05:30
Dmitriy Matrenichev
5f34f5b41f
chore: rename api load balancer to KubePrism
Closes #7432

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-07-14 15:23:53 +03:00
Andrey Smirnov
c8b7095c01
refactor: use tpm2 library to calculate policy hash
No real change, just using library to do the work (should be more
readable).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-07-14 15:07:47 +04:00