3963 Commits

Author SHA1 Message Date
Noel Georgi
5e9d836c3d
chore: add kernel module signtaure verification
Add kernel module signature verification for out of tree kernel modules.

Fixes: #7049

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-04-10 20:05:07 +05:30
Andrey Smirnov
3cd1c6bb0b
fix: send 'STOP' event on phase end
Previously 'START' was sent for both start and finish.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-10 17:56:56 +04:00
Andrey Smirnov
5176d27dc5
feat: update Kubernetes to 1.27.0-rc.1
This has a fix for an issue for DaemonSets and graceful shutdown.

See https://github.com/kubernetes/kubernetes/releases/tag/v1.27.0-rc.1

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-07 12:58:37 +04:00
Andrey Smirnov
2c55550a66
fix: quote ISO kernel args for GRUB
Use GRUB quoting function to the kernel args passed to Talos.

This fixes passing `${variable}` to `talos.config=` kernel argument.

Also fix a problem with `ONBUILD` being exected for `imager` image.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-07 12:29:49 +04:00
Utku Ozdemir
319d76e389
fix: respect BROWSER=echo in client auth interceptor
Do not attempt to open a browser if `BROWSER` env var is set to be `echo`. The browser library does not always to respect that variable.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-04-06 21:47:09 +02:00
Andrey Smirnov
4e4ace839c
chore: update Go to 1.20.3
See https://go.dev/doc/devel/release#go1.20

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-05 21:52:53 +04:00
Andrey Smirnov
170f73899a
fix: correctly parse static pod phase
The problem was that 'Succeeded' pod was treated as 'not ready', so that
`MachineStatus` never reached readiness state.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-05 18:08:20 +04:00
Utku Ozdemir
c3a595d5b7
fix: improve action tracking post checks
In the tracking of the `reset --reboot`, `reboot` and `upgrade` lifecycle commands, verify that the node was actually rebooted in the post check by comparing the pre- and post-check boot IDs.

In the `reset --reboot` post-check, try both maintenance and normal mode, since the reset might be issued to only remove `EPHEMERAL` partition, which will not put the node into the maintenance mode.

Fixes siderolabs/talos#7009.

Additionally, if an action tracking fails, return the error instead of swallowing it. This way the command erminates with a non-zero exit code. Suppress the re-printing this error after the command was run.

Fixes siderolabs/talos#6966.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-04-04 15:28:33 +02:00
Andrey Smirnov
eb01edbc8a
fix: rework DHCP flow
Fixes #7041

Rework the DHCP flow so that we don't use `INFORM` requests anymore. The
idea is to try requesting a hostname from the DHCP server first, and if
the hostname is not send, or it gets overridden in Talos, restart the
DHCP sequence sending the hostname to the DHCP server.

This still avoids sending and requesting a hostname in one request.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-03 17:30:30 +04:00
Andrey Smirnov
e095150a6e
test: bump CAPI components versions
Bringing up to the latest version in 1.x series.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-31 22:29:08 +04:00
Andrey Smirnov
b898081749
release(v1.4.0-alpha.4): prepare release
This is the official v1.4.0-alpha.4 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-31 21:19:02 +04:00
Thomas Way
7ffabe0f14
feat: support network bond device selectors
Fixes https://github.com/siderolabs/talos/issues/6756

Signed-off-by: Thomas Way <thomas@6f.io>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-31 20:29:20 +04:00
Utku Ozdemir
cbab12e3a1
refactor: rename outbound to connectivity on dashboard
Rename to be consistent between the `networkstatus` resource and the dashboard.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-31 15:31:35 +02:00
Artem Chernyshev
07c3c5d59e
feat: return disk subsystem in the Disks API
Fixes: https://github.com/siderolabs/talos/issues/7017

Should allow external services to detect which user block devices might
need to be wiped during reset.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-03-31 16:10:59 +03:00
Andrey Smirnov
b8497b99eb
feat: update containerd to 1.6.20
See https://github.com/containerd/containerd/releases/tag/v1.6.20

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-31 16:14:43 +04:00
Andrey Smirnov
aa14993539
feat: introduce network probes
Network probes are configured with the specs, and provide their output
as a status.

At the moment only platform code can configure network probes.

If any network probes are configured, they affect network.Status
'Connectivity' flag.

Example, create the probe:

```
talosctl -n 172.20.0.3 meta write 0xa '{"probes": [{"interval": "1s", "tcp": {"endpoint": "google.com:80", "timeout": "10s"}}]}'
```

Watch probe status:

```
$ talosctl -n 172.20.0.3 get probe
NODE         NAMESPACE   TYPE          ID                  VERSION   SUCCESS
172.20.0.3   network     ProbeStatus   tcp:google.com:80   5         true
```

With failing probes:

```
$ talosctl -n 172.20.0.3 get probe
NODE         NAMESPACE   TYPE          ID                  VERSION   SUCCESS
172.20.0.3   network     ProbeStatus   tcp:google.com:80   4         true
172.20.0.3   network     ProbeStatus   tcp:google.com:81   1         false
$ talosctl -n 172.20.0.3 get networkstatus
NODE         NAMESPACE   TYPE            ID       VERSION   ADDRESS   CONNECTIVITY   HOSTNAME   ETC
172.20.0.3   network     NetworkStatus   status   5         true      true           true       true

```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-31 15:20:21 +04:00
Noel Georgi
9dc1150e3a
docs: update nvidia instructions
Update NVIDIA install docs and add an example of setting `nvidia` as the
default runtimeclass.

NVIDIA doesn't have published images of vectoradd for CUDA 12, replacing
example with running `nvidia-smi` command.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-31 15:53:42 +05:30
Utku Ozdemir
7967ccfc13
feat: add config code entry screen to dashboard
Implement a screen for entering/managing the config `${code}` variable.

Enable this screen only when the platform is `metal` and there is a `${code}` variable in the `talos.config` kernel cmdline URL query.

Additionally, remove the "Delete" button and its functionality from the network config screen to avoid users accidentally deleting PlatformNetworkConfig parts that are not managed by the dashboard.

Add some tests for the form data parsing on the network config screen.
Remove the unnecessary lock on the summary tab - all updates come from the same goroutine.

Closes siderolabs/talos#6993.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-31 10:33:28 +02:00
Noel Georgi
ddb014cfdc
fix: udevd rules trigger
Restart udevd on adding custom rules where in the case the subsystems
needs to be re-triggered.

Fixes: #7001

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-31 01:59:07 +05:30
Nico Berlee
0af8fe2fb5
feat: netstat pod support
talosctl netstat -k show all host and non-hostnetwork pods sockets/connections.
talosctl netstat namespace/pod shows sockets/connections of a specific pod +
autocompletes in the shell.

Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-30 23:39:38 +04:00
Andrey Smirnov
52e857f55e
feat: linux 6.1.22, runc 1.1.5
Bump dependencies in preparation for Talos 1.4-beta.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-30 21:28:26 +04:00
Utku Ozdemir
aa662ff635
fix: apply small fixes on dashboard
* Clear the input form and switch to summary tab after the network config is saved.
* Use nodeaddress resource for detecting and displaying IPs. Improve the IP filtering logic.
* Fix the logic of gateway detection. Display all gateways instead of a single one.
* Use hostnamestatus resource to detect the hostname instead of an API call.
* Add hostname entry to the network info section on summary tab (as `HOST`).
* Enable `OUTBOUND` entry in network info section on summary tab.
* Display only the physical network interfaces in the interface dropdown on network config tab.
* Improve form input handling.
* Additional minor fixes & improvements.

Closes siderolabs/talos#6992.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-30 09:39:14 +02:00
Andrey Smirnov
188560a334
fix: add a link-scope route if the cmdline gateway is not reachable
Fixes #7020

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-29 22:25:04 +04:00
Dennis Marttinen
45c5b47a57
feat: dhcpv4: send current hostname, fix spec compliance of renewals
This adds support for automatically registering node hostnames in DNS by
sending the current hostname to DHCP via option 12. If the current hostname is
updated, issue a new DISCOVER to propagate the update to DHCP (updating the
hostname on lease renewals is not universally supported by DHCP servers). This
addition maintains the previous functionality where the node can also request
its hostname from the DHCP server. The received hostname will be processed and
prioritized as usual by the `network.HostnameSpecController`.

This change set also contains fixes to make DHCP renewals compliant with RFC
2131, specifically avoiding sending the server identifier and requested IP
address when issuing renewals using a previous offer. This also uncovered
issues and missing features in the upstream `insomniacslk/dhcp` library, the
fixes and improvements for which are now finally merged.

Sending hostname updates have been tested against `dnsmasq` and the built-in
DHCP + DNS services in Windows Server. Hostname retrieval from DHCP and edge
cases with overridden hostnames from different configuration layers have been
extensively tested against `dnsmasq`.

Signed-off-by: Dennis Marttinen <twelho@welho.tech>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-29 21:04:32 +04:00
Andrey Smirnov
289b41fe4b
fix: output of talosctl logs might be corruped
The line was not copied properly, so sometimes `talosctl logs` output
was garbled as the buffer was re-used.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-29 15:52:42 +04:00
Andrey Smirnov
02f0a4526d
feat: allow writing initial META values into the image
E.g. with the command:

```
make image-metal IMAGER_ARGS="--meta 0xc=abc --meta 0xd=abc"
```

This doesn't support ISO/PXE boot yet, it's going to come into the next
PR.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-28 21:27:00 +04:00
Andrey Smirnov
ea0e9bdbe4
feat: environment variables via the kernel arguments
Unify getting environment variables, support passing environment
variables via kernel args.

Fixes #6984
See #6999

For META this will be used to pass environment variables to the
installer for ISO images (or PXE booting).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-28 16:28:33 +04:00
Andrey Smirnov
94c24ca64e
chore: add machine config version contract for v1.4
No changes vs. v1.3, so mostly no-op change just to keep things
consistent.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-27 18:09:21 +04:00
Andrey Smirnov
cefa9c3ecb
feat: update Kubernetes to 1.27.0-rc.0
See https://github.com/kubernetes/kubernetes/releases/tag/v1.27.0-rc.0

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-27 14:32:54 +04:00
Andrey Smirnov
9e8603f53b
feat: implement new download URL variable ${code}
New variable value is coming from `META`, and it might be set using the
interactive console (not implemented yet, but it will come soon).

I had to refactor the URL expansion implementation:

* simplify things where possible
* provide more unit-tests for smaller units
* handle expansion of all variables in parallel
* allow parallel expansion on multiple variables

Also I refactored download code to support proper passing of endpoint
function with context.

The end result:

* Talos will try to download config for 3 hours before rebooting
* Each attempt which includes URL expansion + download is limited to 3
  minutes

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-24 21:49:36 +04:00
Andrey Smirnov
d30cf9c86e
test: fix misprint in e2e scripts
This bug breaks `e2e-extensions`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-24 15:28:18 +04:00
Andrey Smirnov
0d0bb31cf7
fix: use stripped kernel modules
Fixes #6931

See https://github.com/siderolabs/pkgs/pull/690

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-23 23:44:33 +04:00
Andrey Smirnov
3583eea983
release(v1.4.0-alpha.3): prepare release
This is the official v1.4.0-alpha.3 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-23 21:26:22 +04:00
Utku Ozdemir
a7b79ef1be
feat: add network config screen to dashboard
Implement the network config screen with input forms to configure the initial node networking by writing a config to the META partition.

Closes siderolabs/talos#6961.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-23 17:29:52 +04:00
Andrey Smirnov
cf2ccc521f
fix: always shutdown maintenance API service
The problem was that `GracefulStop()` will hang forever if there is a
running API call. So if there is a running streaming call, the
maintenance service might hang until it is finished.

The problem shows up with 'Upgrade' API in the maintenance mode if there
is a concurrent streaming API call, e.g.:

1. Watch API is running against maintenance mode.
2. Upgrade API is issued, it tries to run the MaintenanceUpgrade
   sequence, which tries to take over the Initialize sequence. The
   Initialize sequence is canceled, maintenance API service context is
   canceled, but the service doesn't terminate, as it's stuck in
   `GracefulStop`. The sequence take over times out, as even the
   sequence is canceled, it hasn't terminated yet.

Sample log:

```
[talos] upgrade request received: "ghcr.io/siderolabs/installer:v1.3.3"
[talos] upgrade failed: failed to acquire lock: timeout
[talos] task loadConfig (1/1): failed: failed to receive config via maintenance service: maintenance service failed: context canceled
[talos] phase config (6/7): failed
[talos] initialize sequence: failed
<stuck here>
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-23 16:59:30 +04:00
Andrey Smirnov
a0a5db590d
feat: update Flannel to 0.21.4
See https://github.com/flannel-io/flannel/releases/tag/v0.21.4

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-22 22:28:50 +04:00
Noel Georgi
d1a61fd343
chore: bump golangci-lint
Bump golangci-lint and fixup new warnings. Ignore check that checks for
used function parameters, it's kind of noisy and makes it confusing to
read interface implementations.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-22 19:55:38 +05:30
Noel Georgi
36a9a208ec
chore: bump deps
Bump deps

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-22 16:37:27 +05:30
Noel Georgi
c63cf90e32
feat: update k8s to v1.27.0-beta.0
Update k8s to v1.27.0-beta.0

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-21 23:59:17 +05:30
Dmitriy Matrenichev
b246c90abd
fix: add uint32 to Magic1 and Magic2
Discovered in #6971. Go compiler cannot deduce proper type on 32bit architectures for those constants,
in `fmt.Print(f)` functions. Since we only compare them with uint32 variables, it makes sense to add proper
types to them.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-03-21 09:57:55 -03:00
Andrey Smirnov
777c8d6f6e
chore: update COSI to watch aggregated version
This should fix problem with storm of update events causing buffer
overruns.

See also 66feeeccd91c8db560ae99a960cf4cc7c92594b9.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-21 15:59:17 +04:00
Andrey Smirnov
bec89bf6e5
fix: use 'no block' etcd dial with multiple endpoints
The problem showed up on 'reset' of the Talos node which had multiple
endpoints for other control plane nodes, many of which weren't actually
available.

When 'grpc.WithBlock()' is used, etcd will try to dial the first
endpoint and return an error if the dial fails.

Use noblock mode by default with multiple endpoints, and blocking mode
with a single endpoint.

Pass the context to etcd to properly abort dial operations if the
context get canceled.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-21 15:35:31 +04:00
Andrey Smirnov
28713c2c4d
feat: update Kubernetes to 1.26.3
Mostly to backport to 1.3.x, main should be soon updated to 1.27.x.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-20 23:36:11 +04:00
Dzerom Dzenkins
a3cf416475
docs: add InstallConfig ignored notice to doc
Mention that `.machine.install` gets ignored on pre-installed images.

Signed-off-by: Dzerom Dzenkins <dzeri96@proton.me>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-03-20 22:48:34 +04:00
Noel Georgi
df9b851fba
chore: load all external artifacts earlier
Load all external artifacts early in the build process so that the
binaries are available for e2e tests.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-20 12:29:24 +05:30
Utku Ozdemir
2dd0964c5f
refactor: use resource watches on dashboard
Instead of doing excessive get/list requests, do a watch per node in an infinite retry.

Additionally, refactor the dashboard code to make the various data listener namings more consistent and reorganize the packages.

Closes siderolabs/talos#6960.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-17 23:06:35 +01:00
Noel Georgi
9933ebb6aa
chore: fix loaded artifacts file permission
Azure skips the file permissions when upload/downloaded from the object
store. Make sure all binaries under `_out` have executable permissions.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-17 17:44:59 +05:30
Dmitriy Matrenichev
a14a0aba04
fix: nil pointer exception in syncLink
If link has no `Info` field we can't do anything meaningful, so we'll just log and skip.
Also fix race in test.

For #6956

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-03-17 15:33:20 +04:00
Noel Georgi
cf101e56fb
fix: add --force flag for talosctl gen
Error out if file(s) already exists and warn user to use
`--force` to overwrite.

Fixes: #6963

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-03-17 15:07:12 +05:30
Utku Ozdemir
ea2aa06116
fix: fix data race on network config read
Fix a data race caused by the metadata field of PlatformNetworkConfig being edited after it was sent to the channel. It caused test failures.

Fix it by setting a copy of the metadata instead.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-03-17 00:24:22 +01:00