4141 Commits

Author SHA1 Message Date
Alex Lubbock
ecce29dee9
fix: upgrade-k8s use internal IP first, external IP fallback
Currently, upgrade-k8s adds both node internal and external IPs.
This commit uses the internal IP if available; external IP is
only used as a fallback.

Signed-off-by: Alex Lubbock <code@alexlubbock.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 18:21:27 +04:00
Andrey Smirnov
3c64a5ffba
chore: optimize image generation time
Use `pigz` and `--sparse` to handle more efficiently compression of the
assets.

Also move tasks out of `setup-ci` step, as it runs always, including for
the promoted pipelines.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 17:47:23 +04:00
Eirik Askheim
2292f36d97
chore: registry.k8s.io for coredns image
Replace docker.io with registry.k8s.io for the coredns image.

Signed-off-by: Eirik Askheim <eirik@x13.no>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 17:06:13 +04:00
Alex Corcoles
f2b258b373
docs: document talosctl version for upgrades
Use same version as running cluster.

Signed-off-by: Alex Corcoles <alex@corcoles.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 16:29:53 +04:00
Andrey Smirnov
a0773f783c
chore: add ukify Go script
This is a port of ukify.py and systemd-measure from systemd.

This requires no actual TPM to be present to calculate the PCR
signatures.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-30 23:33:26 +05:30
Andrey Smirnov
b69e38d1ff
chore: bump dependencies
New pkgs, Linux 6.1.30, Flannel 0.22.0, Go modules.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-29 23:19:44 +04:00
DJAlPee
adce651034
docs: add piraeus/drbd to storage documentation
How-To install Piraeus on a Talos cluster

Signed-off-by: DJAlPee <DJAlPee@GitHub.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-29 16:52:01 +05:30
Alex Corcoles
a982cabe70
docs: link support matrix in k8s update doc
Provide a link to explain what versions are supported.

Signed-off-by: Alex Corcoles <alex@pdp7.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-29 14:58:11 +04:00
Andrey Smirnov
1fb29a56a8
fix: fail quickly if upgrade-k8s is used with multiple nodes
Fixes #7283

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-29 14:35:29 +04:00
Noel Georgi
51d931c470
chore: faster dev cycle
This cleans up `Dockerfile` and `Makefile` targets to be in similar
parity with `kres` auto-generated targets.

Now `make talosctl` would only build the one for the specific local
machine making development easier. Also added a `iso` docker target
that builds iso for local development without having to push and pull
the imager. (`make local-iso DEST=_out`)

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-27 19:10:11 +05:30
Andrey Smirnov
dc6764871c
refactor: move around config interfaces, make RawV1Alpha1 typed
See #7230

Refactor more config interfaces, move config accessor interfaces
to different package to break the dependency loop.

Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere.

No functional changes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-23 22:08:58 +04:00
Andrey Smirnov
ea9a97dba3
fix: fall back to external IP when discovering nodes in upgrade-k8s
Fixes #7253

Also fix the case that `kube-proxy` version was updated in the machine
config in `--dry-run` mode.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-23 17:54:35 +04:00
Andrey Smirnov
0bb7e8a5cf
refactor: split config.Provider into Config & Container
See #7230

This is a step towards preparing for multi-doc config.

Split the `config.Provider` interface into parts which have different
implementation:

* `config.Config` accesses the config itself, it might be implemented by
  `v1alpha1.Config` for example
* `config.Container` will be a set of config documents, which implement
  validation, encoding, etc.

`Version()` method dropped, as it makes little sense and it was almost
not used.

`Raw()` method renamed to `RawV1Alpha1()` to support legacy direct
access to `v1alpha1.Config`, next PR will refactor more to make it
return proper type.

There will be many more changes coming up.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-23 16:05:16 +04:00
Dmitriy Matrenichev
85d8a16194
chore: bump deps
Bump deps

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-22 16:02:15 -04:00
Spencer Smith
39b7a56f01
chore: use 8GiB instead of 10GiB for cloud images
This PR changes the default disk size for cloud images to be 8GiB
instead. This was prompted b/c the disk price in azure between tiers is
doubled and the cutoff for the tier is 8GiB.

Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
2023-05-19 20:35:13 -04:00
Andrey Smirnov
ff11fd39c7
fix: race with udevd and mountUserDisks
Fixes #7246

The problem was that `udevd` watches via `inotify` any attempts to open
blockdevices with 'write' access.

Talos was opening with write access, but actually accessing as
read-only, so the fix is to open as read-only.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-19 22:02:48 +04:00
Spencer Smith
c3fabb9829
chore: update default image sizes to 10GB for all "cloud" images
This PR adds a flag to imager that allows for tweaking the size of the created disk. Additionally, it sets the default value of that created disk to 10GB, as most images are cloud images that fail when uploaded b/c it only picks up a 1GB disk currently. Also adds some processing the makefile to make sure we set the default small value for metal images and SBCs.

Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
2023-05-19 13:35:39 -04:00
Andrey Smirnov
10155c390e
feat: enable xfs project quota support, kubelet feature
This is controlled with a feature flag which gets enabled automatically
for Talos 1.5+.

Fixes #7181

If enabled, configures kubelet to use project quotas to track xfs volume
usage, which is much more efficient than doing `du` periodically.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-19 20:33:39 +04:00
Andrey Smirnov
eba8185642
release(v1.5.0-alpha.0): prepare release
This is the official v1.5.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-19 18:38:24 +04:00
Andrey Smirnov
383471c3e9
feat: update default Kubernetes to v1.27.2
See https://github.com/kubernetes/kubernetes/releases/v1.27.2

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-19 15:14:17 +04:00
Dmitriy Matrenichev
8f68d1abef
chore: bump deps
- github.com/benbjohnson/clock to v1.3.5
- cilium-cli to v0.14.3

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-18 19:04:05 -04:00
Christian Rolland
e0c1585d30
feat: create azure community gallery image version on release
Create Azure Community Gallery Image Version on release:
- Add /hack/cloud-image-uploader/azure.go
	- Upload vhd file to container for all architectures
	- Create managed disk from vhd file for all architectures
	- Create image version from managed disk for all architectures
- Modify /hack/cloud-image-uploader/main.go
	- Start Community Gallery processes concurently with AWS upload
- Modify /hack/cloud-image-uploader/options.go
	- Add additional Options for Community Gallery processes
- Modify .drone.jsonnet to use secrets for environment variables
	- The following secrets need to be created for this to work:
		- azure_subscription_id
		- azure_client_id
		- azure_client_secret
		- azure_tenant_id

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>

chore: fix linting errors in readme

Fix linting errors in readme

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>

chore: fix markdown linting errors

Fix markdown linting errors in readme

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>

chore: fix markdown linting errors

Fix markdown linting errors in readme

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>

chore: change disk size to match new 10GB cloud image size

Change disk size to match 10GB cloud image size

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>
2023-05-18 17:49:55 -04:00
Andrey Smirnov
dd8336c9ee
fix: refresh kubelet self-issued serving certificates
Kubelet doesn't refresh self-issued serving certificates, so force it by
removing the cert on each restart.

Fix the code which was forcing rejoin when the nodename changes, it was
broken, as it was checking serving certificate instead of client
certificate. It worked by accident when not using controlplane-issued
serving certificates.

Fixes #7235

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-18 22:19:34 +04:00
Andrey Smirnov
bb02dd263c
chore: drop deprecated stuff for Talos 1.5
* drop old resources API, which was deprecated long time ago
* use bootstrapped event in `talosctl get --watch` to better align
  columns in the table output

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-18 19:46:37 +04:00
Dmitriy Matrenichev
61cad86731
chore: bump deps
- github.com/containerd/typeurl to v2.1.1
- github.com/aws/aws-sdk-go to v1.44.264
- alpine to 3.18.0
- node to 20.2.0-alpine
- github.com/containernetworking/plugins to v1.3.0
- github.com/docker/docker to v23.0.6+incompatible
- github.com/hetznercloud/hcloud-go to v1.45.1
- github.com/insomniacslk/dhcp to v0.0.0-20230516061539-49801966e6cb
- github.com/rivo/tview to v0.0.0-20230511053024-822bd067b165
- tools to v1.5.0-alpha.0-7-gd2dde48
- pkgs to v1.5.0-alpha.0-16-g7958db1

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-18 01:07:36 -04:00
Andrey Smirnov
01dfd3af7d
feat: update etcd to v3.5.9
See https://github.com/etcd-io/etcd/releases/tag/v3.5.9

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-15 15:59:23 +04:00
Ricky Sadowski
aa65fbb8a1
chore: update KUBECTL_URL to reflect the community bucket
This PR changes the url used in the Makefile from a legacy
URL to point to the new community owned download host.

Signed-off-by: Ricky Sadowski <richard.j.sadowski@gmail.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-15 16:54:00 +05:30
Noel Georgi
cc3128d944
chore: bump kernel to 6.1.28
Bump kernel to 6.1.28

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-15 14:30:02 +05:30
Dmitriy Matrenichev
97fffaf78a
chore: use ctest.UpdateWithConflicts instead of plain UpdateWithConflicts
More type-safety.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-12 20:39:32 -04:00
Noel Georgi
3b36993b99
fix: rlimit nofile test
The test was added at the wrong place.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-12 16:20:52 +05:30
Dmitriy Matrenichev
45e6e27af7
chore: bump runtime
Use new functions and methods from runtime module.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-05-11 17:18:08 -04:00
Noel Georgi
4f720d4653
fix: revert: set rlimit explicitly in wrapperd
This reverts commit a2565f67416e9b9bc22f2d5506df9ea7771c0c8c.

The fix done in `a2565f67`, was actually a no-op caused by the
misunderstanding the fix done in Go and backported to [Go 1.20.4](ecf7e00db8).
The fix gave a false confidence that it was working when it was tested
against Talos `main` branch since the PR #7190 bumped `x/sys` package
from [v0.7.0 -> v0.8.0](ecf7e00db8), the actual change in `x/sys` can be found here at ff18efa0a3 which meant that when updating Go to 1.20.4 the `x/sys` package should been updated too. The `x/sys` package changed how the syscall to set the rlimit was called, it got moved into the Go stdlib instead of calling rlimit syscall in the `x/sys` package, which meant a combination of using Go 1.20.4 and an older `x/sys` package means `RLIMIT_NOFILE` value would not be set back to the original value.

The Talos 1.4 release branch currently have  `x/sys`
at [v0.7.0(https://github.com/siderolabs/talos/blob/v1.4.3/go.mod#L133),
so the backport would consist of this change along another commit bumping `x/sys` package to `v0.8.0`.

Fixes: #7198
Fixes: #7206

Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-11 23:38:20 +05:30
Noel Georgi
a2565f6741
fix: set rlimit explicitly in wrapperd
Now Go only sets the rlimit for the parent and any fork/exec'ed process
gets the rlimit that was the default before fork/exec. Ref: https://github.com/golang/go/issues/46279

This fix got backported to [Go 1.20.4](ecf7e00db8) breaking Talos.
Talos used to set rlimit in the [`SetRLimit`](https://github.com/siderolabs/talos/blob/v1.4.2/internal/app/machined/pkg/runtime/v1alpha1/v1alpha1_sequencer_tasks.go#L302) sequencer task.
This means any process started by `wrapperd` gets the default Rlimit
(1024). Fix this by explicitly setting `rlimit` in `wrapperd` before we
drop any capabilities.

Fixes: #7198

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-10 00:10:17 +05:30
Andrey Smirnov
cdfc242b83
chore: re-enable Go buildid
With Go 1.20.4 the reproducibility issue is fixed.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-08 21:40:42 +04:00
Andrey Smirnov
e67f3f5c54
feat: linux 6.1.27, containerd 1.6.21, go 1.20.4
Plus bunch of other dependencies.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-08 20:26:19 +04:00
Andrey Smirnov
55ae59a0ad
fix: properly skip/cleanup controlplane configs for workers
This bug is pretty cosmetic, but it shows up as a wrong check when
performing worker upgrade - Talos pretends it checks e.g. kube-apiserver
version which doesn't make sense for workers.

There were two bugs in the code:

* check for machine type was done against `TypeWorker`, while
  `MachineType` resource is initially created as `TypeUnknown`
* the cleanup code was not implemented

As I touched the code, I updated controller and tests to use modern
conventions.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-05 23:17:27 +04:00
Andrey Smirnov
64eade9bde
chore: clean up unused constant
There is another `KubeProxyImage` which we're actually using.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-05 20:18:10 +04:00
Utku Ozdemir
62c6e9655c
feat: introduce siderolink config resource & reconnect
Introduce a new resource, `SiderolinkConfig`, to store SideroLink connection configuration (api endpoint for now).

Introduce a controller for this resource which populates it from the Kernel cmdline.

Rework the SideroLink `ManagerController` to take this new resource as input and reconfigure the link on changes.

Additionally, if the siderolink connection is lost, reconnect to it and reconfigure the links/addresses.

Closes siderolabs/talos#7142, siderolabs/talos#7143.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-05-05 17:04:34 +02:00
Andrey Smirnov
860002c735
fix: don't reload control plane pods on cert SANs changes
Fixes #7159

The change looks big, but it's actually pretty simple inside: the static
pods had an annotation which tracks a version of the secrets which
forced control plane pods to reload on a change. At the same time
`kube-apiserver` can reload certificate inputs automatically from files
without restart.

So the inputs were split: the dynamic (for kube-apiserver) inputs don't
need to be reloaded, so its version is not tracked in static pod
annotation, so they don't cause a reload. The previous non-dynamic
resource still causes a reload, but it doesn't get updated when e.g.
node addresses change.

There might be many more refactoring done, the resource chain is a bit
of a mess there, but I wanted to keep number of changes minimal to keep
this backportable.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-05 16:59:09 +04:00
Andrey Smirnov
d43c61e80f
fix: enforce nolock option for all NFS mounts by default
Talos doesn't have `rpc.statsd` running, so mounting without locking is
the only option. Some places in Kubernetes don't allow to set mount
options for NFS, so setting defaults is the only way.

Fixes #6582

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-04 17:26:36 +04:00
Niklas Wik
339986db9d
fix: inhibit timer to follow kubelet timer
Ensure to wait as long as possibly given to kubelet shutdown timers.
Related to fix of siderolabs#7138

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-04 15:08:56 +04:00
Andrey Smirnov
cbf6dc1009
fix: set timeout for unmount calls
Fixes #7137

The `umount` syscall might hang "forever" if the underlying network
filesystem endpoint is down.

To be on the safe side, add a timeout around unmount operations, and try
to umount with force as a last resort.

Sample log:

```
14795.458779] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/dbe8d7f58e21d06cbef1ae0849317661eba4e82776722e7db5c65194ad73e916/globalmount/0001-0009-rook-ceph-0000000000000001-1051beb3-8d7a-4291-bf45-5711c13523d1
[14795.459797] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount
[14795.460555] EXT4-fs (rbd0): unmounting filesystem.
[14813.461319] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 1m11.999162834s
[14831.460813] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 53.999567033s
[14849.461336] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 35.998979117s
[14867.460748] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 17.999502128s
[14885.461123] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount with force
[14885.462395] [talos] ignoring unmount error /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount: invalid argument
[14885.463529] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-0888dc71-ba9e-af8a-d322-074f654561e5
[14885.464267] [talos] task unmountPodMounts (2/2): done, 1m30.028862262s
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-03 23:32:23 +04:00
Andrey Smirnov
b58f913d5f
fix: set the static pod priority as values
API server takes care of setting priority for "regular" pods from
priorityClassName, but nothing does that for static pods, so we have to
specify the priotity explicitly for static pods.

This fixes the graceful node shutdown (kubelet) to stop non-critical
pods before the api-server and friends (critical pods).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-02 20:20:20 +04:00
Steve Francis
f8a7a5b6bf
docs: add information about KubeSpan ports and topology
Update KubeSpan documentation.

Signed-off-by: Steve Francis <steve.francis@talos-systems.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-01 17:57:43 +04:00
Steve Francis
2bad74d642
docs: add how to on scaling down
Describe scaling down Talos cluster.

Signed-off-by: Steve Francis <steve.francis@talos-systems.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-01 16:48:13 +04:00
Thomas Perronin
7442ff8b09
chore: fix typos inteface -> interface (docs and tests)
Fix typos.

Signed-off-by: Thomas Perronin <gecko.splinter@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-01 16:15:08 +04:00
Michael A. Davis
d4e94f7a15
fix: add back required TARGETARCH for installer
Adds back in the required TARGETARCH for installer so extensions can be built off installer again as nvidia nonfree extension building was broken.

Fixes: #7155
Refs: #7115

Signed-off-by: Michael A. Davis <6325127+mrmichaeladavis@users.noreply.github.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-01 15:41:56 +04:00
Andrey Smirnov
e6fffda013
chore: linux 6.1.26, runc 1.1.7
Update to the latest pkgs.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-27 20:31:04 +04:00
Andrey Smirnov
344746ae2f
fix: bump max inhibit delay to 20 min
Fixes #7138

This brings max shutdown period to 20 min that kubelet would accept.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-27 17:08:04 +04:00
Andrey Smirnov
d9bdea2b54
chore: fork docs and compatibility modules for Talos 1.5
Getting ready for the next Talos 1.5.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-04-27 15:36:31 +04:00