2122 Commits

Author SHA1 Message Date
Andrey Smirnov
66f3ffdd4a
fix: ensure that Talos runs in a pod (container)
Drop the Kubernetes manifests as static files clean up (this is only
needed for upgrades from 1.2.x).

Fix Talos handling of cgroup hierarchy: if started in container in a
non-root cgroup hiearachy, use that to handle proper cgroup paths.

Add a test for a simple TinK mode (Talos-in-Kubernetes).

Update the docs.

Fixes #8274

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-20 15:06:48 +04:00
Noel Georgi
9dbc33972a
feat: add basic syslog implementation
Add a basic syslog listening on `/dev/log`.

Fixes: #8087

Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-20 15:02:06 +05:30
Utku Ozdemir
0b7a27e6a1
feat: allow access to all resources over siderolink in maintenance mode
SideroLink is a secure channel, so we can allow read access to the resources. This will give us more control of the node via Omni and/or other systems using SideroLink.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-02-16 16:39:11 +01:00
Andrey Smirnov
7ee999f8a3
fix: disable KubeSpan endpoint harvesting by default
This disables by default (if not specified in the machine config) the
endpoint harvesting for KubeSpan peers.

The idea was to observe Wireguard endpoints as seen by other peers in
the cluster, and add them to the list of endpoints for the node. This
might be helpful only in case of some special type of NATs which are
almost never seen in the wild today.

So disable by default, but keep an option to enable it.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-16 18:18:33 +04:00
Utku Ozdemir
493bb60f81
fix: correctly handle partial configs in DNSUpstreamController
Prevent `DNSUpstreamController` from panicking by checking if the `machine` section in the config is `nil`. This is the case when a machine has partial configuration, e.g., when the machine has only a `SideroLinkConfig` in its config.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-02-16 10:31:54 +01:00
Andrey Smirnov
1366ce14a8
feat: update Kubernetes to v1.30.0-alpha.2
Talos Linux 1.7.0 will ship with Kubernetes v1.30.0.

Drop some compatibility for Kubernetes < 1.25, as 1.25 is the minimum
supported version now.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-15 21:56:56 +04:00
Noel Georgi
15e8bca2b2
feat: support environment in ExtensionServicesConfig
Support setting extension services environment variables in
`ExtensionServiceConfig` document.

Refactor `ExtensionServicesConfig` -> `ExtensionServiceConfig` and move extensions config under `runtime` pkg.

Fixes: #8271

Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-15 20:16:29 +05:30
Matthieu S
3fe82ec461
feat: custom image settings for k8s upgrade
Allows to use custom registry/images.

Fixes: #8275

Co-authored-by:  @g3offrey
Signed-off-by: Matthieu STROHL <mstrohl@dive-in-it.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-15 17:54:01 +04:00
Dmitriy Matrenichev
fa3b933705
chore: replace fmt.Errorf with errors.New where possible
This time use `eg` from `x/tools` repo tool to do this.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-14 17:39:30 +03:00
Noel Georgi
2f0421b406
fix: run xfs_repair on invalid argument error
Run `xfs_repair` for invalid argument error.

Part of: #8292

Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-13 23:01:33 +05:30
Dmitriy Matrenichev
fa2d34dd88
chore: enable v6 support on the same port
Replace `SO_REUSEPORT` with `SO_REUSEPORT`.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-13 01:02:27 +03:00
Dmitriy Matrenichev
83e0b0c19a
chore: adjust dns sockets settings
Enable some TCP optimization, set minimal TTL, set socket reuse.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-12 17:13:03 +03:00
Dmitriy Matrenichev
5324d39167
chore: bump stuff
Also fix .golangci.yml file.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-09 19:19:25 +03:00
Dmitriy Matrenichev
afa71d6b02
chore: use "handle-like" resource in DNSResolveCacheController
Rework (and simplify) `DNSResolveCacheController` to use `DNSUpstream` "handle-like" resources.

Depends on https://github.com/cosi-project/runtime/pull/400

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-08 21:40:57 +03:00
Andrey Smirnov
3f8a85f1b3
fix: unlock the upgrade mutex properly
Fixes #4525

The previous implementation had several issues:

* etcd concurrency session never closed
* Unlock() with potentially closed context
* unlocking when upgrade sequence finishes, but this overlaps with the
  machine reboot, so a chance that it never got unlocked

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-08 15:50:02 +04:00
Noel Georgi
1e6c8c4dec
feat: extensions services config
Support config files for extension services.

Fixes: #7791

Co-authored-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-06 17:12:01 +05:30
shurkys
989ca3ade1
feat: add OpenNebula platform support
Initial support without documentation.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Signed-off-by: shurkys <no@mail.com>
2024-02-05 20:43:47 +04:00
Henno Schooljan
a04cc80154
fix: pass TTL when generating client certificate
Pass the TTL to the talosconfig generation function.

Signed-off-by: Henno Schooljan <github@sfynx.nl>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 18:54:16 +04:00
Dmitriy Matrenichev
3fe8c12ca6
fix: add log line about controller runtime failing
While we decide what to do with #8263 and #8256 this quickfix at least allows us to
see what went wrong

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-02-05 17:22:02 +03:00
Andrey Smirnov
ddbabc7e58
fix: use a separate cgroup for each extension service
Fixes #8229

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 17:37:55 +04:00
Saiyam Pathak
4184e617ab
chore: add test for wasmedge runtime extension
Add tests for WasmEdge container runtime system extension.

Signed-off-by: Saiyam Pathak <saiyam911@gmail.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2024-02-05 18:18:13 +05:30
Andrey Smirnov
95ea3a6c65
chore: bump timeout in acquire tests
With switching to RSA service account, machine config generation time is
considerably higher now, so the test might not make it in time.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-05 15:18:22 +04:00
Andrey Smirnov
2ff81c06bc
feat: update runc 1.1.12, containerd 1.7.13
Also:

* Linux 6.6.14 + XDP enablement
* etcd 3.5.12

Various other bumps for the tools, utilities, and Go modules.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 17:01:04 +04:00
Andrey Smirnov
9d8cd4d058
chore: drop deprecated method EtcdRemoveMember
It was deprecated 16 months ago, time to cleanup.

(This is to prepare for the first v1.7 release)

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 15:54:29 +04:00
Andrey Smirnov
17567f19be
fix: take into account the moment seen when cleaning up CRI images
Fixes #8069

The image age from the CRI is the moment the image was pulled, so if it
was pulled long time ago, the previous version would nuke the image as
soon as it is unreferenced. The new version would allow the image to
stay for the full grace period in case the rollback is requested.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-02-01 14:44:22 +04:00
Andrey Smirnov
593afeea38
fix: run the interactive installer loop to report errors
In the previous implementation, even though `installer.err` was set, it
was never checked 🤦.

The run loop was stolen from the dashboard code.

Fixes #8205

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-31 19:20:46 +04:00
Andrey Smirnov
87be76b878
fix: be more tolerant to error handling in Mounts API
Fixes #8202

If some mountpoint can't be queried successfully for 'diskfree'
information, don't treat that as an error, and report zero values for
disk usage/size instead.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-31 18:24:38 +04:00
Dmitriy Matrenichev
ebeef28525
feat: implement local caching dns server
This PR adds a new controller - `DNSServerController` that starts tcp and udp dns servers locally. Just like `EtcFileController` it monitors `ResolverStatusType` and updates the list of destinations from there.

Most of the caching logic is in our "lobotomized" "`CoreDNS` fork. We need this fork because default `CoreDNS` carries
full Caddy server and various other modules that we don't need in Talos. On our side we implement
random selection of the actual dns and request forwarding.

Closes #7693

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-01-29 20:26:38 +03:00
Andrey Smirnov
b44551ccdb
feat: update Linux to 6.6.13
See https://github.com/siderolabs/pkgs/pull/873

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-29 16:50:33 +04:00
Andrey Smirnov
d677901b67
feat: implement device selector for 'physical'
Closes #8090

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-23 15:05:51 +04:00
Andrey Smirnov
c1e45071f0
refactor: use etcd configuration from the EtcdSpec resource
This is currently no-op, just noticed that while looking into another
bug. This should make the intention more clean.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-22 16:06:16 +04:00
Andrey Smirnov
474eccdc4c
fix: watch bufer overrun for RouteStatus
Fixes #8157

This PR contains two fixes, both related to the same problem.

Several routes for different links but  same IPv6 destination might exist
at the same time, so route resource ID should handle that. The problem
was that these routes were mis-reported causing internally updates for
the same resources multiple times (equal to the number of the links).

Don't trigger controllers more often than 10 times/seconds (with burst of
5) for kernel notifications. This ensures Talos doesn't try to reflect
current state of the network subsystem too often as resources, which
causes excessive CPU usage and might potentially lead to the buffer
overrun under high rate of changes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-17 19:28:25 +04:00
Andrey Smirnov
9782319c31
fix: support KubePrism settings in Kubernetes Discovery
Fixes #8143

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-16 20:41:13 +04:00
Dmitriy Matrenichev
f70b47dddc
fix: force KubePrism to connect using IPv4
Before this change KubePrism used hardcoded "localhost" as destination which Go could resolve to IPv6 destination and
then fail to connect to. This change forces KubePrism to connect using IPv4 and uses hardcoded "127.0.0.1" destination so
it will always use IPv4.

For #8112

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2024-01-15 21:25:05 +03:00
Utku Ozdemir
7fa7362ddc
fix: fix nodes on dashboard footer when node names are used in --nodes
When the dashboard is used via the CLI through a proxy, e.g., through Omni, node names or IDs can be used in the `--nodes` flag instead of the IPs.

This caused rendering inconsistencies in the dashboard, as some parts of it used the IPs and some used the names passed in the context.

Fix this by collecting all node IPs on dashboard start, and map these IPs to the respective nodes passed as the `--nodes` flag.

On the dashboard footer, we always display the node names as they are passed in the `--nodes` flag.

As part of it, remove the node list change reactivity from the dashboard, so it will always take the passed nodes as the truth.

The IP to node mapping collection at dashboard startup also solves another issue where the first API call by the dashboard triggered the interactive API authentication (e.g., the OIDC flow). Previously, because the terminal was already switched to the raw mode, it was not possible to authenticate properly.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2024-01-12 12:00:08 +01:00
Jonomir
dea9bda2d0
fix: disk UUID & WWID always empty in talosctl disks
Add missing attributes to conversion of go-blockdevice disk
to protobuf disk.

Signed-off-by: Jonomir <68125495+Jonomir@users.noreply.github.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2024-01-11 14:37:39 +04:00
Serge Logvinov
f6926faab5
fix: default priority for ipv6
We will use the default IPv6 gateway priority as 2048.
The RA default is 1024, which leads to verbose messages such as 'error adding route: netlink receive: file exists.'

Azure uses DHCPv6 and RA for configuring IPv6 on the node.
The platform sets the default gateway as a fallback in case 'accept_ra' is not set to 2.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-29 18:42:23 +04:00
Andrey Smirnov
0a30ef7845
fix: imager should support different Talos versions
Add some quirks to make images generated with newer Talos compatible
with images generated by older Talos.

Specifically, reset options were adding in Talos 1.4, so we shouldn't
add them for older versions.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-22 16:13:34 +04:00
Andrey Smirnov
e6e422b92a
chore: bump dependencies
Go modules, tools, etc.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-21 19:01:16 +04:00
Dmitriy Matrenichev
59b62398f6
chore: modernize machined/pkg/controllers/k8s
This is going to be multipart effort to finally use safe.* wrappers in the production code.
Quick regexp search shows that there are around 150 direct type assertions on resources (excluding the ones in this commit).

Also - migrate from `interface{}` to `any` and use `slices.Sort*` instead of `sort.*` where possible.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-12-15 19:33:06 +03:00
Andrey Smirnov
760f793d55
fix: use correct prefix when installing SBC files
When creating an image under non-default mount prefix, it should be
used explicitly when copying SBC files.

See https://github.com/siderolabs/image-factory/issues/65

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-15 19:46:10 +04:00
Noel Georgi
0b94550c42
chore: fix the gvisor test
The gvisor test was not using the correct runtimeclass and would have
always passed the regardless.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-12-15 20:48:44 +05:30
Andrey Smirnov
d803e40ef2
docs: provide documentation for Talos 1.6
Updated lots of documentation with new/updated flows.

Provide What's New for Talos 1.6.0.

Update Troubleshooting guide to cover more steps.

Make Talos 1.6 docs the default.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-15 16:36:57 +04:00
Andrey Smirnov
10c59a6b90
fix: leave discovery service later in the reset sequence
Fixes #8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-13 19:16:12 +04:00
Andrey Smirnov
131a1b1671
fix: add a KubeSpan option to disable extra endpoint harvesting
It works well for small clusters, but with bigger clusters it puts too
much load on the discovery service, as it has quadratic complexity in
number of endpoints discovered/reported from each member.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-12 14:07:31 +04:00
Artem Chernyshev
4547ad9afa
feat: send actor id to the SideroLink events sink
This might come handy to distinguish sequences, tasks initiated by a
particular API request.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-12-11 21:59:02 +03:00
Dmitriy Matrenichev
6bb1e99aa3
chore: optimize pcap dump
Reimplement `gopacket.PacketSource.PacketsCtx` as `forEachPacket`.

- Use `ZeroCopyPacketDataSource` instead of `PacketDataSource`. I didn't find any specific reason why `PacketDataSource` exists at all, since `NewPacket` is doing copy inside if you don't explicitly tell it not to.
- Use `WillPool` to pool packet buffers. It doesn't fully remove allocations, but it's a safe start.
  Send packets back into the pool after we are done with them.
- Pass `Packet` directly to the closure instead of waiting for it on the channel. We don't store this packet anywhere so there is no reason to async this part.
- Drop `time.Sleep` code in `forEachPacket` body.
- Drop `SnapLen` support in client and server since it didn't work anyway (details in the PR).

Closes #7994

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-12-11 15:44:42 +03:00
Andrey Smirnov
46121c9fec
docs: rework machine config documentation generation
Generate a structured table of contents following the structure of the
config.

Make high-level examples follow the full structure of the config.

Document new multi-doc machine config.

Fixes #8023

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-08 14:16:40 +04:00
Andrey Smirnov
270604bead
fix: support user disks via symlinks
The core blockdevice library already supported resolving symlinks, we
just need to get the raw block device name from it, and use it
afterwards.

In QEMU provisioner, leave the first (system) disk as virtio (for
performance), and mount user disks as 'ata', which allows `udevd` to
pick up the disk IDs (not available for `virtio`), and use the symlink
path in the tests.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-05 22:02:56 +04:00
Andrey Smirnov
474fa0480d
fix: store and execute desired action on emergency action
Fixes #7854

Talos runs an emergency handler if the sequence experience and
unrecoverable failure. The emergency handler was unconditionally
executing "reboot" action if no other action was received (which only
gets received if the sequence completes successfully), so the Shutdown
request might result in a Reboot behavior on error during shutdown
phase.

This is not a pretty fix, but it's hard to deliver the intent from one
part of the code to another right now, so instead use a global variable
which stores default emergency intention, and gets overridden early in
the Shutdown sequence.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
2023-12-04 19:51:48 +04:00