1379 Commits

Author SHA1 Message Date
Andrey Smirnov
883d401f9f
chore: rename github organization to siderolabs
Go module import paths still use talos-systems, packages use new
siderolabs name.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-23 21:07:46 +03:00
Andrey Smirnov
f477507262
fix: the etcd recovery client and tests
This is the follow-up fix to the PR #5129.

1. Correctly catch only expected errors in the tests.
2. Rewind the snapshot each time the upload is retried.
3. Correctly unwrap errors in the `EtcdRecovery` client.
4. Update the `grpc-proxy` library to pass through the EOF error.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-22 16:51:36 +03:00
Andrey Smirnov
69e07cddc7
fix: trigger properly udevd on types and actions
Default `--type` is `devices`, so trigger explicitly on both `devices`
and `subsystems` and use `add` action to mock initial events better.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-21 21:58:32 +03:00
Andrey Smirnov
47d0e629d4
fix: clean up custom udev rules if the config is cleared
This fixes a case when udev rules are first created in the machine
config and then removed from the config.

As the file is on the overlayfs, it persists over reboots, so we need to
write it every time we boot Talos.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-21 20:38:27 +03:00
Andrey Smirnov
b6691b3508
chore: bump dependencies
dependabot + go-mod-outdated

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-21 18:24:00 +03:00
Artem Chernyshev
27af5d41c6
feat: pause the boot process on some failures instead of rebooting
Some failures can be fixed by updating the machine configuration.
Now `userDisks` and `userFiles` do not make Talos to enter into reboot
loop but pause for 35 minutes.

Additionally, `apid` and `machined` are now started right after
containerd is up and running.

That makes it possible for the operator to connect to the node using
talosctl and fix the config.

Fixes: https://github.com/talos-systems/talos/issues/4669
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-03-21 17:39:45 +03:00
Andrey Smirnov
58cb9db1e2
feat: allow hardlinks in the system extension images
They should cause no harm as every extension as an image on its own, so
hardlinks are only available between the files in one image only.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-21 15:38:34 +03:00
Andrey Smirnov
1e982808fb
fix: ignore pod CIDRs for kubelet node IPs
I'm not sure how I haven't noticed that before, but that is easily
reproducible with virtual IP moving between the nodes: Talos incorrectly
assumes that pod IPs might be valid kubelet node IPs, and this might
lead to unexpected results if the kubelet node IP is picked to be equal
to pod CIDR.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-21 14:52:40 +03:00
Andrey Smirnov
c156580a38
fix: split regular network operation configuration and virtual IP
The problem is that Virtual IP operator configuration might require
accessing platform metadata server (e.g. on Equinix Metal), while
regular operator sets up critical operators like DHCP.

The issue observed on Equinix Metal without the split:

* on initial boot, DHCP is set up on `eth2`
* platform network configuration is fetched and `bond0` configuration is
created
* node IP is assigned both to `eth2` and `bond0`, while `eth2` is a
slave to `bond0`
* networking is broken
* operator config controller is stuck trying to fetch EM VIP
configuration, as the network is broken, it fails to do so, but retries
for 3 minutes (in `download.Download`)
* network is broken for 3 minutes until `OperatorConfig` controller is
unblocked and cleans up DHCP operator for `eth2` as it should

The issue here is that DHCP operator setup is much more tricky on one
hand (depends on link status, other configuration items, etc.), while
VIP operator depends on DHCP operator setup, as it needs outbound
networking.

By splitting the controllers, we split the flows and remove
dependencies.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-18 20:24:47 +03:00
Andrey Smirnov
cd4d4c6054
feat: relax extensions file structure validation
* allow empty directories (I see no harm in having them)
* allow symlinks

See also https://github.com/talos-systems/extensions/pull/20

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-17 21:21:42 +03:00
Andrey Smirnov
327ce5aba3
fix: invert the condition to skip kubelet kernel checks
We should skip the checks on container platforms, as Talos has no way to
enforce conditions on the host kernel.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-16 21:43:51 +03:00
Andrey Smirnov
caf800fe84
feat: implement D-Bus systemd-compatible shutdown for kubelet
Add a mock D-Bus daemon and a mock logind implementation over D-Bus.

Kubelet gets a handle to the D-Bus socket, connects over it to our
logind mock and negotiates shutdown activities.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-16 13:57:46 +03:00
Andrey Smirnov
355b1a4bed
fix: refresh etcd certs on startup/join
This is more of a bandaid, rather than a real fix. As this should be
bacported to `release-1.0`, I tried to avoid doing big changes.

The race condition: controller correctly watches network state and
issues etcd certs as needed, but the service `etcd` writes down PKI
files from the resource just once early on startup. In this case there's
a chance that wrong PKI gets written to disk leaving etcd with
incomplete certs.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-15 22:13:51 +03:00
Andrey Smirnov
f448cb4f3c
feat: bump boot partition size to 1000 MiB
With system extensions, size of the `initramfs` might increase
significantly. With 1000 MiB `/boot`, as we store `A` and `B` boot
directories, we have 500 MiB for each Talos boot (size of the kernel and
initramfs).

Fixes #5096

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-10 16:52:07 +03:00
Noel Georgi
a095acb09f
chore: fix equinixMetal platform name
fix equinix platform name

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-03-10 00:15:51 +05:30
Seán C McCord
2a7f9a4457
fix: check for IPv6 before applying accept_ra
When IPv6 is disabled entirely, we should not try to set `accept_ra`,
since it does not exist.
This performs a check before adding the default kernel parameter.

Fixes #5087

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-03-07 10:13:08 -05:00
Andrey Smirnov
59681b8c9a
fix: backport fixes from release-1.0 branch
They were discovered as we tagged 1.0.0 version:

* wrong deprecated version
* incompatibility in extension compatibility checks

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-04 23:28:06 +03:00
Noel Georgi
dc8e9ed4a5
feat: bond interfaces from kernel cmdline
Support bond interfaces from kernel cmdline using `bond=` format

Fixes: #4765

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-03-03 23:54:53 +05:30
Artem Chernyshev
a50747a64a
fix: align list and diskusage command flags with their Linux analogs
Fixes: https://github.com/talos-systems/talos/issues/3018

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-03-02 22:27:56 +03:00
Andrey Smirnov
09efa62f68
chore: re-enable kexec and default to UEFI booting in tests
Fixes #4947

It turns out there's something related to boot process in BIOS mode
which leads to initramfs corruption on later `kexec`.

Booting via GRUB is always successful.

Problem with kexec was confirmed with:

* direct boot via QEMU
* QEMU boot via iPXE (bundled with QEMU)

The root cause is not known, but the only visible difference is the
placement of RAMDISK with UEFI and BIOS boots:

```
[    0.005508] RAMDISK: [mem 0x312dd000-0x34965fff]
```

or:

```
[    0.003821] RAMDISK: [mem 0x711aa000-0x747a7fff]
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-02 21:52:18 +03:00
Serge Logvinov
61461de634
feat: define resource reservation
Set memory/cpu resource reservation for system processes.
It helps system processes to allocate memory on memory pressure
situation.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-02 17:18:03 +03:00
Andrey Smirnov
7ddc7f6053
feat: support specifying env vars for control plane pods
Fixes #5055

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-03-01 22:51:57 +03:00
Serge Logvinov
de69ab7902
fix: scaleway network config
We've forgot to apply network setting.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-28 21:12:01 +03:00
Andrey Smirnov
f81fb9f7cf
feat: implement sysfs
Fixes: https://github.com/talos-systems/talos/issues/4703

Co-authored-by: Dmitriy Matrenichev <lepage+gh@protonmail.com>
Co-authored-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2022-02-28 17:51:02 +03:00
Serge Logvinov
79d9720a35
fix: set route to metaserver for scaleway platform
Set default route to metaserver, which exists only on eth0 interface.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-25 22:48:51 +03:00
Andrey Smirnov
eb40b9254f
feat: add a way to override kubelet configuration via machine config
Fixes #4629

Note: some fields are enforced by Talos and are not overridable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-25 17:39:01 +03:00
Noel Georgi
dc23715478
chore: update packet to equinix
Update `packet` to `equinix` for `talos.platform` kernel argument

Fixes: #5010

Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-02-25 00:50:02 +05:30
Andrey Smirnov
7917b1aca0
feat: support admission control configuration and Pod Security admission
Fixes #5003

This implements a way to configure API server admission plugins via
Talos machine configuration.

If Pod Security admission is enabled, default cluster-wide policy is
generated which enforces baseline policy.

Policy can be overridden per-namespace.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-24 16:18:15 +03:00
Andrey Smirnov
b2bf3117ff
feat: implement extension services
Fixes #4694

User services run alongside with Talos system services.
Every user service container root filesystem should be already present
in the Talos root filesystem.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-22 23:11:20 +03:00
Serge Logvinov
d749643e7e
feat: download metadata on Scaleway using low source port
This feature allow to us use low source port <1024 to make a http calls.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-22 16:50:05 +03:00
Matt Layher
743a030025
chore: bump github.com/mdlayher/arp@latest
Newest version of github.com/mdlayher/arp backed by the improved
https://github.com/mdlayher/packet package. There's no stable release
of arp yet but I'd like to get back around to that now that I'm stabilizing underlying pieces.

Signed-off-by: Matt Layher <mdlayher@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-22 15:07:17 +03:00
Seán C McCord
4d419a007f
feat: store audit logs to disk
Instead of bundling the apiserver audit logs with the rest of the
apiserver logs, we should store them separately to file, assuring
reasonable defaults for retention and rotation.

Fixes #5000

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-02-21 09:12:27 -05:00
Seán C McCord
a5fb271ac8
feat: enable protectKernelDefaults in kubelet_spec
Enable the kubelet's builtin kernel configuration checks.
Also limits streaming connection timeout.

Fixes #5002
Fixes #4990

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-02-18 11:03:06 -05:00
Utku Ozdemir
4d5cd66538
feat: add new grub parser and descriptive grub menu entries
Rewrite the grub config parser code, allow to have descriptive Grub entries.
Remove old syslinux bootloader.

Fixes talos-systems/talos#4914

Signed-off-by: Utku Ozdemir <uoz@protonmail.com>
2022-02-18 14:47:17 +03:00
Andrey Smirnov
95a564ba2a
fix: prefer logical on merging link specs
This solves a case when lower layer (platform) defines `bond0` as
logical interface properly, and upper layer (configuration) defines only
some part of the config (e.g. VIP).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-17 00:31:04 +03:00
Serge Logvinov
8b7091a06e
fix: correct vultr interface IP calculation
netaddr.Netmask changes the source ip to net clean subnet:
    10.1.2.3/24 -> 10.1.2.0/24

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-17 00:01:32 +03:00
Serge Logvinov
5a0fd63c81
fix: determine openstack interface IP correctly
netaddr.Netmask changes the source ip to net clean subnet:
    10.1.2.3/24 -> 10.1.2.0/24

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-16 23:28:58 +03:00
Andrey Smirnov
c6bca1b33b
docs: add guide on system extensions
This is very first guide, we can expand it as we get more details.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 22:04:36 +03:00
Andrey Smirnov
492b156dab
feat: implement static pods via machine configuration
Fixes #4727

On worker nodes, static pods are injected, but status can't be monitored
by Talos. On control plane nodes full status is available via
`StaticPodStatus`.

Pod definition is left as `Unstructured` in the machine configuration,
and no specific validation is performed to avoid pulling in Kubernetes
libraries into Talos machinery package.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 18:37:19 +03:00
Andrey Smirnov
6fadfa8dbc
fix: parse properly IPv6 address in the cmdline ip= arg
Fixes #4953

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 16:57:39 +03:00
Serge Logvinov
cbc9610be6
feat: sysctl system optimization
This PR changes most common tweaks.

* inotify uses for reload config files if it changed
* tcp_keepalive_* helps to refrech tcp state connections

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 15:50:58 +03:00
Serge Logvinov
8b6d6220d3
fix: parse interface ip correctly (nocloud)
netaddr.Netmask changes the source ip to net clean subnet:

  10.1.2.3/24 -> 10.1.2.0/24

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 15:31:15 +03:00
Andrey Smirnov
0da370dfef
test: unlock CABPT/CACPPT provider versions
We should always test latest versions of our providers.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-10 00:14:15 +03:00
Andrey Smirnov
df0e388a4f
feat: extract firmware part of system extensions into initramfs
Fixes #4816

This changes the way system extensions are packaged into the squashfs
images: `/lib/firmware` is now moved out of the future squashfs images
and becomes part of `initramfs` to make firmware available in the early
boot.

Talos will bind-mount `/lib/firmware` into rootfs as well, so it will be
available in the rootfs as well.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-09 22:58:45 +03:00
Andrey Smirnov
6bd07406e1
feat: disable reboots via kexec
See #4947

The goal is to disable kexec temporarily to move on with the system
extensions, and to find the root cause and fix kexec before the next
release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-09 16:38:49 +03:00
Seán C McCord
d211bff47d
feat: enable accept_ra when IPv6 forwarding
Enables `accept_ra = 2` when IPv6 forwarding is enabled.

When IPv6 forwarding is enabled, the default `accept_ra = 1` no longer
functions.
This is intentional by the kernel developers, because routers generally should not
accept router advertisements (they supply their own).
However, in the case of a machine running Kubernetes, while IP
forwarding is enabled, the machine is still treated more as an end node
than a router.
It is common for a Kubernetes node to be configured via SLAAC and
therefore to expect to receive router advertisements, while at the same
time, IP forwarding must be enabled to handle container communication.

Fixes #3841

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-02-07 20:05:18 -05:00
Seán C McCord
c347683670
fix: disable auto-tls for etcd
While we use properly-generated certs, it is (according to STIG 242379)
possible to allow a client to downgrade to self-signed acceptance without explicitly
disabling `auto-tls`.
This patch sets `auto-tls` to `false`, preventing the downgrade.

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2022-02-05 15:37:09 -05:00
Andrey Smirnov
9bffc7e8d5
fix: pass proper sequence to shutdown sequence on ACPI shutdown
Shutdown sequence was refactored to support draining and force mode, but
other invocations of the shutdown sequence haven't been updated.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-03 21:47:23 +03:00
Andrey Smirnov
5484579c1a
feat: allow link scope routes in the machine config
They were supported internally, but never properly exposed in the
machine configuration.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-03 15:08:26 +03:00
Tim Jones
fe40e7b1b3
feat: drain node on shutdown
Cordon & drain a node when the Shutdown message is received.
Also adds a '--force' option to the shutdown command in case the control
plane is unresponsive.

Signed-off-by: Tim Jones <timniverse@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-02-01 00:06:32 +03:00