1955 Commits

Author SHA1 Message Date
Henno Schooljan
a04cc80154
fix: pass TTL when generating client certificate
Pass the TTL to the talosconfig generation function.

Signed-off-by: Henno Schooljan <github@sfynx.nl>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 18:54:16 +04:00
Dmitriy Matrenichev
3fe8c12ca6
fix: add log line about controller runtime failing
While we decide what to do with #8263 and #8256 this quickfix at least allows us to
see what went wrong

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-05 17:22:02 +03:00
Andrey Smirnov
ddbabc7e58
fix: use a separate cgroup for each extension service
Fixes #8229

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 17:37:55 +04:00
Saiyam Pathak
4184e617ab
chore: add test for wasmedge runtime extension
Add tests for WasmEdge container runtime system extension.

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-05 18:18:13 +05:30
Andrey Smirnov
95ea3a6c65
chore: bump timeout in acquire tests
With switching to RSA service account, machine config generation time is
considerably higher now, so the test might not make it in time.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 15:18:22 +04:00
Andrey Smirnov
2ff81c06bc
feat: update runc 1.1.12, containerd 1.7.13
Also:

* Linux 6.6.14 + XDP enablement
* etcd 3.5.12

Various other bumps for the tools, utilities, and Go modules.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 17:01:04 +04:00
Andrey Smirnov
9d8cd4d058
chore: drop deprecated method EtcdRemoveMember
It was deprecated 16 months ago, time to cleanup.

(This is to prepare for the first v1.7 release)

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 15:54:29 +04:00
Andrey Smirnov
17567f19be
fix: take into account the moment seen when cleaning up CRI images
Fixes #8069

The image age from the CRI is the moment the image was pulled, so if it
was pulled long time ago, the previous version would nuke the image as
soon as it is unreferenced. The new version would allow the image to
stay for the full grace period in case the rollback is requested.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 14:44:22 +04:00
Andrey Smirnov
593afeea38
fix: run the interactive installer loop to report errors
In the previous implementation, even though `installer.err` was set, it
was never checked 🤦.

The run loop was stolen from the dashboard code.

Fixes #8205

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-31 19:20:46 +04:00
Andrey Smirnov
87be76b878
fix: be more tolerant to error handling in Mounts API
Fixes #8202

If some mountpoint can't be queried successfully for 'diskfree'
information, don't treat that as an error, and report zero values for
disk usage/size instead.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-31 18:24:38 +04:00
Dmitriy Matrenichev
ebeef28525
feat: implement local caching dns server
This PR adds a new controller - `DNSServerController` that starts tcp and udp dns servers locally. Just like `EtcFileController` it monitors `ResolverStatusType` and updates the list of destinations from there.

Most of the caching logic is in our "lobotomized" "`CoreDNS` fork. We need this fork because default `CoreDNS` carries
full Caddy server and various other modules that we don't need in Talos. On our side we implement
random selection of the actual dns and request forwarding.

Closes #7693

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-01-29 20:26:38 +03:00
Andrey Smirnov
b44551ccdb
feat: update Linux to 6.6.13
See https://github.com/siderolabs/pkgs/pull/873

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-29 16:50:33 +04:00
Andrey Smirnov
d677901b67
feat: implement device selector for 'physical'
Closes #8090

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-23 15:05:51 +04:00
Andrey Smirnov
c1e45071f0
refactor: use etcd configuration from the EtcdSpec resource
This is currently no-op, just noticed that while looking into another
bug. This should make the intention more clean.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-22 16:06:16 +04:00
Andrey Smirnov
474eccdc4c
fix: watch bufer overrun for RouteStatus
Fixes #8157

This PR contains two fixes, both related to the same problem.

Several routes for different links but  same IPv6 destination might exist
at the same time, so route resource ID should handle that. The problem
was that these routes were mis-reported causing internally updates for
the same resources multiple times (equal to the number of the links).

Don't trigger controllers more often than 10 times/seconds (with burst of
5) for kernel notifications. This ensures Talos doesn't try to reflect
current state of the network subsystem too often as resources, which
causes excessive CPU usage and might potentially lead to the buffer
overrun under high rate of changes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-17 19:28:25 +04:00
Andrey Smirnov
9782319c31
fix: support KubePrism settings in Kubernetes Discovery
Fixes #8143

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-16 20:41:13 +04:00
Dmitriy Matrenichev
f70b47dddc
fix: force KubePrism to connect using IPv4
Before this change KubePrism used hardcoded "localhost" as destination which Go could resolve to IPv6 destination and
then fail to connect to. This change forces KubePrism to connect using IPv4 and uses hardcoded "127.0.0.1" destination so
it will always use IPv4.

For #8112

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-01-15 21:25:05 +03:00
Utku Ozdemir
7fa7362ddc
fix: fix nodes on dashboard footer when node names are used in --nodes
When the dashboard is used via the CLI through a proxy, e.g., through Omni, node names or IDs can be used in the `--nodes` flag instead of the IPs.

This caused rendering inconsistencies in the dashboard, as some parts of it used the IPs and some used the names passed in the context.

Fix this by collecting all node IPs on dashboard start, and map these IPs to the respective nodes passed as the `--nodes` flag.

On the dashboard footer, we always display the node names as they are passed in the `--nodes` flag.

As part of it, remove the node list change reactivity from the dashboard, so it will always take the passed nodes as the truth.

The IP to node mapping collection at dashboard startup also solves another issue where the first API call by the dashboard triggered the interactive API authentication (e.g., the OIDC flow). Previously, because the terminal was already switched to the raw mode, it was not possible to authenticate properly.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-01-12 12:00:08 +01:00
Jonomir
dea9bda2d0
fix: disk UUID & WWID always empty in talosctl disks
Add missing attributes to conversion of go-blockdevice disk
to protobuf disk.

Signed-off-by: Jonomir <68125495+Jonomir@users.noreply.github.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-11 14:37:39 +04:00
Serge Logvinov
f6926faab5
fix: default priority for ipv6
We will use the default IPv6 gateway priority as 2048.
The RA default is 1024, which leads to verbose messages such as 'error adding route: netlink receive: file exists.'

Azure uses DHCPv6 and RA for configuring IPv6 on the node.
The platform sets the default gateway as a fallback in case 'accept_ra' is not set to 2.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-29 18:42:23 +04:00
Andrey Smirnov
0a30ef7845
fix: imager should support different Talos versions
Add some quirks to make images generated with newer Talos compatible
with images generated by older Talos.

Specifically, reset options were adding in Talos 1.4, so we shouldn't
add them for older versions.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-22 16:13:34 +04:00
Andrey Smirnov
e6e422b92a
chore: bump dependencies
Go modules, tools, etc.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-21 19:01:16 +04:00
Dmitriy Matrenichev
59b62398f6
chore: modernize machined/pkg/controllers/k8s
This is going to be multipart effort to finally use safe.* wrappers in the production code.
Quick regexp search shows that there are around 150 direct type assertions on resources (excluding the ones in this commit).

Also - migrate from `interface{}` to `any` and use `slices.Sort*` instead of `sort.*` where possible.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-12-15 19:33:06 +03:00
Andrey Smirnov
760f793d55
fix: use correct prefix when installing SBC files
When creating an image under non-default mount prefix, it should be
used explicitly when copying SBC files.

See https://github.com/siderolabs/image-factory/issues/65

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-15 19:46:10 +04:00
Noel Georgi
0b94550c42
chore: fix the gvisor test
The gvisor test was not using the correct runtimeclass and would have
always passed the regardless.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-12-15 20:48:44 +05:30
Andrey Smirnov
d803e40ef2
docs: provide documentation for Talos 1.6
Updated lots of documentation with new/updated flows.

Provide What's New for Talos 1.6.0.

Update Troubleshooting guide to cover more steps.

Make Talos 1.6 docs the default.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-15 16:36:57 +04:00
Andrey Smirnov
10c59a6b90
fix: leave discovery service later in the reset sequence
Fixes #8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-13 19:16:12 +04:00
Andrey Smirnov
131a1b1671
fix: add a KubeSpan option to disable extra endpoint harvesting
It works well for small clusters, but with bigger clusters it puts too
much load on the discovery service, as it has quadratic complexity in
number of endpoints discovered/reported from each member.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-12 14:07:31 +04:00
Artem Chernyshev
4547ad9afa
feat: send actor id to the SideroLink events sink
This might come handy to distinguish sequences, tasks initiated by a
particular API request.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-12-11 21:59:02 +03:00
Dmitriy Matrenichev
6bb1e99aa3
chore: optimize pcap dump
Reimplement `gopacket.PacketSource.PacketsCtx` as `forEachPacket`.

- Use `ZeroCopyPacketDataSource` instead of `PacketDataSource`. I didn't find any specific reason why `PacketDataSource` exists at all, since `NewPacket` is doing copy inside if you don't explicitly tell it not to.
- Use `WillPool` to pool packet buffers. It doesn't fully remove allocations, but it's a safe start.
  Send packets back into the pool after we are done with them.
- Pass `Packet` directly to the closure instead of waiting for it on the channel. We don't store this packet anywhere so there is no reason to async this part.
- Drop `time.Sleep` code in `forEachPacket` body.
- Drop `SnapLen` support in client and server since it didn't work anyway (details in the PR).

Closes #7994

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-12-11 15:44:42 +03:00
Andrey Smirnov
46121c9fec
docs: rework machine config documentation generation
Generate a structured table of contents following the structure of the
config.

Make high-level examples follow the full structure of the config.

Document new multi-doc machine config.

Fixes #8023

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-08 14:16:40 +04:00
Andrey Smirnov
270604bead
fix: support user disks via symlinks
The core blockdevice library already supported resolving symlinks, we
just need to get the raw block device name from it, and use it
afterwards.

In QEMU provisioner, leave the first (system) disk as virtio (for
performance), and mount user disks as 'ata', which allows `udevd` to
pick up the disk IDs (not available for `virtio`), and use the symlink
path in the tests.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-05 22:02:56 +04:00
Andrey Smirnov
474fa0480d
fix: store and execute desired action on emergency action
Fixes #7854

Talos runs an emergency handler if the sequence experience and
unrecoverable failure. The emergency handler was unconditionally
executing "reboot" action if no other action was received (which only
gets received if the sequence completes successfully), so the Shutdown
request might result in a Reboot behavior on error during shutdown
phase.

This is not a pretty fix, but it's hard to deliver the intent from one
part of the code to another right now, so instead use a global variable
which stores default emergency intention, and gets overridden early in
the Shutdown sequence.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-04 19:51:48 +04:00
Andrey Smirnov
dbf274ddf7
fix: skip writing the file if the contents haven't changed
As the controller reconciles every /etc file present, it might be called
multiple times for the same file, even if the actual contents haven't
changed.

Rewriting the file might lead to some concurrent process seeing
incomplete file contents more often than needed.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-04 15:58:03 +04:00
Andrey Smirnov
d8a435f0e4
fix: initialize boot assets with defaults early
The problem was that bootloaders were correctly picking up defaults for
`installer` mode (vs. `imager` mode), but DTB and other SBC stuff wasn't
properly initialized, so installing on SBC fails.

Now all options are properly initialized with defaults early in the
process.

Fixes #8009

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-01 17:47:05 +04:00
Andrey Smirnov
c6835de17a
fix: pick etcd adverised addresses from 'current' addresses
Fixes #7947

This way etcd advertised address can be picked from the `external IPs`
of the machine.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-01 17:26:28 +04:00
Andrey Smirnov
e71e3e4161
feat: support extra arguments for flanneld
Fixes #7754

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-01 16:18:02 +04:00
Andrey Smirnov
36c8ddb5e1
feat: implement ingress firewall rules
Fixes #4421

See documentation for details on how to use the feature.

With `talosctl cluster create`, firewall can be easily test with
`--with-firewall=accept|block` (default mode).

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-30 22:58:16 +04:00
Dmitriy Matrenichev
0b111ecb81
fix: support slices of enums and fix NfTablesConntrackStateMatch
We already have the code which supports custom enums, so let's extend it to support custom enums in slices and
fix the NfTablesConntrackStateMatch proto definition.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-11-30 00:23:16 +03:00
Andrey Smirnov
9a85217412
feat: improve nftables backend
Many changes to the nftables backend which will be used in the follow-up
PR with #4421.

1. Add support for chain policy: drop/accept.
2. Properly handle match on all IPs in the set (`0.0.0.0/0` like).
3. Implement conntrack state matching.
4. Implement multiple ifname matching in a single rule.
5. Implement anonymous counters.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-29 21:22:47 +04:00
Noel Georgi
f041b26299
chore: add tests for mdadm extension
Add tests for mdadm extension.

See: https://github.com/siderolabs/extensions/pull/271

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-11-27 23:18:35 +05:30
Andrey Smirnov
e46e6a312f
feat: implement nftables backend
Implement initial set of backend controllers/resources to handle
nftables chains/rules etc.

Replace the KubeSpan nftables operations with controller-based.

See #4421

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-27 21:14:15 +04:00
Dmitriy Matrenichev
ba827bf8b8
chore: support getting multiple endpoints from the Provision rpc call
The code will rotate through the endpoints, until it reaches the end, and only then it will try to do the provisioning again.

Closes #7973

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-11-25 21:38:44 +03:00
Dmitriy Matrenichev
dd45dd06cf
chore: add custom node taints
This PR adds support for custom node taints. Refer to `nodeTaints` in the `configuration` for more information.

Closes #7581

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-11-25 18:33:18 +03:00
Dmitriy Matrenichev
70d53ee13c
chore: deprecate .persist and .extensions
This commit deprecates those things:
- Removes the support of `.persist` flag. From now, it should always be enabled or not defined in the config.
- Removes the documentation for `.bootloader`. It never worked anyway.
- Adds a warning for `.machine.install.extensions`, suggests to use boot-assets.

Closes #7972
Closes #7507

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-11-22 20:35:38 +03:00
Noel Georgi
aca8b5e179
fix: ignore kernel command line in container mode
Ignore kernel command line for `SideroLink` and `EventsSink` config when
running in container mode. Otherwise when running Talos as a docker
container in Talos it picks up the host kernel cmdline and try to
configure SideroLink/EventsSink.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-11-21 18:55:37 +05:30
Andrey Smirnov
27d208c26b
feat: implement OAuth2 device flow for machine config
Fixes #7939

See documentation in the PR for the description of the feature.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-20 14:31:43 +04:00
Noel Georgi
5c8fa2a803
chore: start containerd early in boot
Start container early in the boot process so system extension services
start in maintenance mode.

Fixes: #7083

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-11-16 23:19:33 +05:30
Noel Georgi
0d3c3ed716
feat: support kube scheduler config
Support kube-scheduler config.

Fixes: #7905
Partially fixes: #7911

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-11-15 10:15:23 +05:30
Andrey Smirnov
06941b7e5c
fix: allow rootfs propagation configuration for extension services
Fixes #7873

Some services which perform mounts inside the container which require
mounts to propagate back to the host (e.g. `stargz-snapshotter`) require
this configuration setting.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-11-13 21:58:22 +04:00