2047 Commits

Author SHA1 Message Date
Noel Georgi
1207054599
chore: handle I/O error for xfs_repair
Run `xfs_repair` on `unix.EIO` error.

```text
16T18:19:30.85674118Z]: XFS (sdb5): Mounting V5 Filesystem
109.200.197.196: kern:    info: [2024-04-16T18:19:30.92421418Z]: XFS (sdb5): Ending clean mount
109.200.197.196: kern:  notice: [2024-04-16T18:19:36.42651618Z]: XFS (sdb6): Mounting V5 Filesystem
109.200.197.196: kern:    info: [2024-04-16T18:19:36.49568918Z]: XFS (sdb6): Ending clean mount
109.200.197.196: kern:  notice: [2024-04-16T18:19:36.54484918Z]: XFS (sdb6): Quotacheck needed: Please wait.
109.200.197.196: kern:  notice: [2024-04-16T18:19:36.54586418Z]: XFS (sdb6): Quotacheck: Done.
109.200.197.196: kern:   alert: [2024-05-13T15:13:11.99476118Z]: XFS (sdb6): log I/O error -5
109.200.197.196: kern:   alert: [2024-05-13T15:13:11.99477118Z]: XFS (sdb6): Filesystem has been shut down due to log error (0x2).
109.200.197.196: kern:   alert: [2024-05-13T15:13:11.99477318Z]: XFS (sdb6): Please unmount the filesystem and rectify the problem(s).
```

Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-05-13 21:19:50 +05:30
Andrey Smirnov
b7afe2669b
feat: update Linux 6.6.30
Update tools/pkgs to the latest version, brings in all updates.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-13 17:14:03 +04:00
Andrey Smirnov
1d29111d43
chore: update Go to 1.22.3
Also bump dependencies.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-08 14:59:41 +04:00
Serge Logvinov
f4d7b9d9a9
feat: gather plaform dns names
Retrieve the DNS names of instances from the platform metadata.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-08 00:11:24 +04:00
Andrey Smirnov
763dae2508
fix: add cluster name to the worker machine config
This is 1.8+ only.

Fixes #8694

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-07 20:11:23 +04:00
Andrew Rynhard
4aac5b4ec3
feat: mount /sys/kernel/security into kubelet
This allows the kubelet to detect AppArmor.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-07 19:12:06 +04:00
Andrey Smirnov
07f78182c6
fix: use a fresh context for etcd unlock
By the time unlock is called, context might be already canceled.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-05-01 18:59:50 +04:00
Andrey Smirnov
b690ffeb89
test: improve DNS resolver test stability
Run a health check before the test, as the test depends on CoreDNS being
healthy, and previous tests might disturb the cluster.

Also refactor by using watch instead of retries, make pods terminate
fast.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-29 19:31:34 +04:00
Birger J. Nordølum
5aa0299b6e
style: use correct capitalization for openstack
The current form of OpenStack is not capitalized correctly. Stack should
be written with a large S, like OpenStack and not Openstack.

Signed-off-by: Birger J. Nordølum <contact@mindtooth.no>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-29 18:46:06 +04:00
Andrey Smirnov
4c0c626b78
feat: use zstd compression in place of xz
Initramfs and kernel are compressed with zstd.

Extensions are compressed with zstd for Talos 1.8+.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-29 18:09:12 +04:00
Andrey Smirnov
98906ed6ea
fix: use reboot delay only in case of error
Delay the reboot for 10 seconds only if Talos hits an error, but
otherwise just proceed with the requested action.

This removes 10 seconds on "regular" reboot without kexec.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-26 18:46:00 +04:00
Andrey Smirnov
05fd042bb3
test: improve the reset integration tests
Provide a trace for each step of the reset sequence taken, so if one of
those fails, integration test produces a meaningful message instead of
proceeding and failing somewhere else.

More cleanup/refactor, should be functionally equivalent.

Fixes #8635

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-24 18:35:39 +04:00
Dmitriy Matrenichev
ccdb4c8b10
chore: update google.golang.org/grpc to 1.63.2
Update other modules while we are at it.

Closes #8628

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-23 16:39:28 +03:00
Andrey Smirnov
c5b59df697
fix: wait for devices to be discovered before probing filesystems
With Talos 1.7+, more storage drivers are split as modules, so the
devices might not be discovered by the time platform config is going to
be loaded. Explicitly wait for udevd to settle down before trying to
probe a CD.

Fixes #8625

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-23 16:42:40 +04:00
Andrey Smirnov
2bf613ad3b
fix: add endpoints for "virtual" host-dns service
Without endpoints `kube-proxy` adds an automatic reject rule for the
service if it has no endpoints which breaks host network namespace DNS
resolving with `forwardKubeDNSToHost: true`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-22 21:26:44 +04:00
Andrey Smirnov
f4163aefed
fix: bump priority of OpenStack routes if IPv6 and default gateway
IT looks like gateway is sometimes reported as a 'route' skipping a
'gateway' field.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-22 20:40:50 +04:00
Dmitry Sharshakov
6fbd1263cc
feat: report process MAC labels
This will be useful for debugging process access rights once we start implementing SELinux

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
2024-04-22 18:16:33 +03:00
Andrey Smirnov
bac1d00c35
chore: prepare for Talos 1.8
Fork docs, introduce version contract for 1.8.

Clean up old version contracts 0.8-0.14.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-19 18:19:36 +04:00
Dmitriy Matrenichev
908f67fa15
feat: add host dns support for resolving member addrs
Closes #8330

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-18 15:29:30 +03:00
Dmitriy Matrenichev
ec69d7a785
chore: replace math/rand with math/rand/v2
New package arrived in Go 1.22 which provides better rand primitives and functions.
Use it instead of the old one.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-18 13:20:59 +03:00
Andrey Smirnov
3433fa13bf
feat: use container DNS when in container mode
More specifically, pick up `/etc/resolv.conf` contents by default when
in container mode, and use that as a base resolver for the host DNS.

Fixes #8303

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-16 17:01:36 +04:00
Andrey Smirnov
5d07ac5a7d
fix: close apid inter-backend connections gracefully for real
Fixes #8552

This fixes up the previous fix where `for` condition was inverted, and
also updates the idle timeout, so that the transition to idle happens
before the timeout expires.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-16 16:21:34 +04:00
Artem Chernyshev
3dd1f4e88c
chore: extract pkg/imager/quirks to pkg/machinery
To make it possible to use it without pulling the whole Talos.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2024-04-15 21:37:47 +03:00
Andrey Smirnov
bfbd02abfb
fix: assign different priority to IPv6 default gateway on OpenStack
Fixes #8558

Similar fix is done for other platforms, but not OpenStack.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-10 21:02:13 +04:00
Andrey Smirnov
c8f674bd3d
test: add a test for 'spin' container runtime
See https://github.com/siderolabs/extensions/pull/355

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-10 20:42:16 +04:00
Dmitriy Matrenichev
5390ccd48c
chore: replace []byte with string and use go:embed for templates
Optimize code a bit.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-10 17:47:43 +03:00
Dmitriy Matrenichev
ba7cdc8c8b
chore: optimize DNSResolveCacheController
Optimize `DNSResolveCacheController` type, including `dns.Server` optimization for easy start/stop. This PR ensures that we
delete server from runners on stop (even unexpected) and restart it properly. Also fixes incorrect assumption on unit-tests.

Fixes #8563

This PR also does those things:
- Removes `utils.Runner`
- Removes `ctxutil.MonitorFn`
- Removes `dns.Runner`
- Removes `network.dnsRunner`

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-10 17:24:19 +03:00
Utku Ozdemir
3735add87c
fix: reconnect to the logs stream in dashboard after reboot
The log stream displayed in the dashboard was stopping to work when a node was rebooted.
Rework the log data source to establish a per-node connection and use a retry loop to always reconnect until the dashboard is terminated.

Print the connection errors in the log stream in red color.

Closes siderolabs/talos#8388.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-04-10 10:43:45 +02:00
Andrey Smirnov
9aa1e1b79b
fix: present all accepted CAs to the kube-apiserver
This fixes an issue with a single controlplane cluster.

Properly present all accepted CAs to the apiserver, in the test let the
cluster fully recovery between two CA rotations performed.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-08 23:33:22 +04:00
Andrey Smirnov
336e611746
fix: close the apid connection to other machines gracefully
Fixes #8552

When `apid` notices update in the PKI, it flushes its client connections
to other machines (used for proxying), as it might need to use new
client certificate.

While flushing, just calling `Close` might abort already running
connections.

So instead, try to close gracefully with a timeout when the connection
is idle.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-08 19:47:04 +04:00
Andrey Smirnov
ff2c427b04
fix: pre-create nftables chain to make kubelet use nftables
In Talos, kubelet (and kube-proxy) images use `iptables-wrapper` script
to detect which version of `iptables` (legacy or NFT) to use.

The script assumes that `kubelet` runs on the host, and uses whatever
version of `iptables` which is being used by the host. In Talos,
`kubelet` runs in a container which has same `iptables-wrapper` script,
and it defaults to `legacy` mode in our case.

We can't check the `kubelet` image, as it would affect all Talos
version, so instead pre-create the chains/tables in `nftables` so that
kubelet will pick up `nft` version of `iptables`, and `kube-proxy` will
do the same.

Without this fix, the problem arises from the mix of `nft` used by Talos
for the firewall and Kubernetes world relying on `legacy` (`xtables`).

Fixes https://github.com/siderolabs/kubelet/issues/77

See e139a11535/iptables-wrapper-installer.sh (L102-L130)

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-08 16:24:42 +04:00
Dmitriy Matrenichev
01d8b897c4
fix: make safeReset truly safe to call multiple times
Reading documentation is important, because `timer.Stop()` explicitly says that it will return false if it
already expired *OR* it has been already stopped. Previous version of code would block forever and because of
that code tunnel relay never started.

Take that into account with new version.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-05 00:34:17 +03:00
Dmitry Sharshakov
653f838b09
feat: support multiple Docker cluster in talosctl cluster create
Dynamically map Kubernetes and Talos API ports to an available port on
the host, so every cluster gets its own unique set of parts.

As part of the changes, refactor the provision library and interfaces,
dropping old weird interfaces replacing with (hopefully) much more
descriprive names.

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-04 21:21:39 +04:00
Andrey Smirnov
951904554e
chore: bump dependencies (go 1.22.2)
Update Go to 1.22.2, update Go modules to resolve
[HTTP/2 issue](https://www.kb.cert.org/vuls/id/421644).

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-04 14:59:24 +04:00
Andrey Smirnov
862c76001b
feat: add support for CoreDNS forwarding to host DNS
This PR adds the support for CoreDNS forwarding to host DNS. We try to bind on 9th address on the first element from
`serviceSubnets` and create a simple service so k8s will not attempt to rebind it.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Co-authored-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-03 23:36:17 +03:00
Evan Johnson
e8ae5ef63a
feat: add akamai platform support
Add support for the Akamai(Linode) platform

Signed-off-by: Evan Johnson <ejohnson@akamai.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-03 19:50:42 +04:00
Andrey Smirnov
5c0f74b377
fix: don't announce the VIP on acquire failure
I noticed that while looking at #8493, but I don't know if this problem
actually happened in real life.

If acquiring a VIP fails (which can only fail for Equinix/HCloud, not L2
ARP announce), we should not set the leader flag, as it would make the
controller announce the IP, while it shouldn't do that.

If this call fails, there's no matching call to de-announce on failure.

The bug would show up as two nodes having same VIP assigned on the host.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-03 18:04:44 +04:00
Andrey Smirnov
1b17008e9d
fix: handle more OpenStack link types
Fixes #8481

The issue was that the link 'bridge' was skipped, so Talos default was
applied to run DHCP and use the DHCP hostname (instead of using
platform's hostname).

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-03 16:54:36 +04:00
Andrey Smirnov
e7d8041404
fix: always update firewall rules (kubespan)
Fixes #8498

Before KubeSpan was reimplemented to use resources for firewall rules,
the update was happening always, but it got moved to a wrong section of
the controller which gets executed on resource updates, but ignores
updates of the peer statuses.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-03 16:33:16 +04:00
Andrey Smirnov
78b9bd9273
fix: report unsupported x86_64 microarchitecture level
Fixes #8361

Talos requires v2 (circa 2008), but VMs are often configured to limit
the exposed features to the baseline (v1).

```
[    0.779218] [talos] [initramfs] booting Talos v1.7.0-alpha.1-35-gef5bbe728-dirty
[    0.779806] [talos] [initramfs] CPU: QEMU Virtual CPU version 2.5+, 4 core(s), 1 thread(s) per core
[    0.780529] [talos] [initramfs] x86_64 microarchitecture level: 1
[    0.781018] [talos] [initramfs] it might be that the VM is configured with an older CPU model, please check the VM configuration
[    0.782346] [talos] [initramfs] x86_64 microarchitecture level 2 or higher is required, halting
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-03 16:09:57 +04:00
Dmitriy Matrenichev
71d90ba5f3
fix: retry in the fixed amount of time if grpc relay failed
Before this commit, if tunnel failed with error, it would never restart again until `siderolink.TunnelType` event happen.
For most of the time it's a good idea, because it might mean that destination has changed.

But tunnel can also fail because allowed peer list is not yet loaded on newly started Omni instance.

Because of that, we want to try again and not be tied to the runtime event channel.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-03 14:03:42 +03:00
Andrey Smirnov
3195e5d15c
fix: force Flannel CNI to use KubePrism Kubernetes API endpoint
Fixes #8501

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-02 22:01:05 +04:00
Noel Georgi
f515741b52
chore: add equinix e2e-tests
Add equinix e2e-tests.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-04-02 17:16:59 +05:30
Andrey Smirnov
117e60583d
feat: add support for static extra fields for JSON logs
Fixes #7356

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-02 15:15:14 +04:00
Andrey Smirnov
090143b030
fix: allow platform cmdline args to be platform-specific
Fix Equnix Metal (where proper arm64 args are known) and metal platform
(using generic arm64 console arg).

Other platforms might need to be updated, but correct settings are not
known at the moment.

Fixes #8529

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-02 14:41:39 +04:00
Andrey Smirnov
7a68504b6b
feat: support rotating Kubernetes CA
Fixes #8440

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-04-01 22:08:02 +04:00
Dmitriy Matrenichev
8dc4910c48
chore: enable "WG over GRPC" testing in siderolink agent tests
Fixes https://github.com/siderolabs/talos/issues/8514
For https://github.com/siderolabs/talos/issues/8392

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-04-01 18:24:57 +03:00
Dmitry Sharshakov
9456489147
feat: support hardware watchdog timers
Only enabled when activated by config, disabled on shutdown/reboot

Fixes #8284

Signed-off-by: Dmitry Sharshakov <dmitry.sharshakov@siderolabs.com>
Signed-off-by: Dmitry Sharshakov <d3dx12.xx@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-03-25 18:19:39 +03:00
Dmitriy Matrenichev
949ad11a2d
chore: import siderolink as siderolink-launch subcommand
This PR ensures that we can test our siderolink communication using embedded siderolink-agent.
If `--with-siderolink` provided during `talos cluster create` talosctl will embed proper kernel string and setup `siderolink-agent` as a separate process. It should be used with combination of `--skip-injecting-config` and `--with-apply-config` (the latter will use newly generated IPv6 siderolink addresses which talosctl passes to the agent as a "pre-bind").

Fixes #8392

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-03-23 16:08:56 +03:00
Andrey Smirnov
8eacc4ba80
feat: support rotation of Talos API CA
This allows to roll all nodes to use a new CA, to refresh it, or e.g.
when the `talosconfig` was exposed accidentally.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-03-22 12:16:47 +04:00