1685 Commits

Author SHA1 Message Date
Dmitriy Matrenichev
638dc9128f
fix: fix "defer" leak in ResetUserDisks
Also, print error if we failed to close the device.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-02-28 21:51:37 +03:00
Dmitriy Matrenichev
bfba3677b0
chore: handle grub option - "wipe"
This PR ensures that we can handle third grub option - "wipe". We will use it in 1.4.

For #6842

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-02-28 21:21:28 +03:00
Artem Chernyshev
b520710810
feat: introduce new flag in reset API that makes Talos reset user disks
Fixes: https://github.com/siderolabs/talos/issues/6815

Additionally, make it possible to run reset in maintenance mode: to
enable a way for resetting system disk and remove all traces of Talos
from it.

The new reset flow works in a separate sequence, changed disk probe
lookup to check the boot partition instead of the ephemeral one.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-02-28 15:10:41 +03:00
Utku Ozdemir
f55f5df739
feat: move dashboard package & run it in tty2
Move dashboard package into a common location where both Talos and talosctl can use it.

Add support for overriding stdin, stdout, stderr and ctt in process runner.

Create a dashboard service which runs the dashboard on /dev/tty2.

Redirect kernel messages to tty1 and switch to tty2 after starting the dashboard on it.

Related to siderolabs/talos#6841, siderolabs/talos#4791.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-02-28 12:00:25 +01:00
Dmitriy Matrenichev
36e077ead4
chore: bump deps
- github.com/aws/aws-sdk-go to v1.44.209
- github.com/stretchr/testify to v1.8.2
- github.com/jsimonetti/rtnetlink to v1.3.1
- google.golang.org/genproto to v0.0.0-20230223222841-637eb2293923
- github.com/emicklei/dot to v1.3.1
- github.com/gdamore/tcell/v2 to v2.6.0
- github.com/insomniacslk/dhcp to v0.0.0-20230220063916-5369909a5de7
- github.com/jsimonetti/rtnetlink to v1.3.1
- github.com/opencontainers/runtime-spec to v1.1.0-rc.1.0.20230215090456-58ec43f9fc39
- github.com/rivo/tview to v0.0.0-20230226195229-47e7db7885b4

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-02-28 00:14:59 +03:00
Noel Georgi
426fe9687d
fix: extension base folder permission
The `modules.dep` kernel module dependency tree extension root path was
previously created with a permission of `0o700` which means the talos
root go a permission of `0o700` when the kernel module tree was re-built
when extensions providing kernel modules was enabled. This means that
any binaries lost the executable permission when ran as non-root
creating an `EACCES` error. Fix by making sure the temporary directory
created for building kernel modules tree has `0o755` permission
explicitly.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-27 19:49:06 +05:30
Andrey Smirnov
230e46e567
refactor: extract parts of kubernetes libraries
The shared code is going out to the
github.com/siderolabs/go-kubernetes library.

The code will be used in Talos and other projects using same features.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-22 14:56:49 +04:00
Utku Ozdemir
5ac9f43e45
feat: start machined earlier & in maintenance mode
Load & start machined earlier and in initialize sequence, so that it is possible to use its API over its unix socket in maintenance mode.

Additionally, do not return features from Version API  if a config is not yet available.

Related to siderolabs/talos#4791.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-02-21 12:21:36 +01:00
Dmitriy Matrenichev
3d55bd80f4
fix: add --force flag to talosctl gen config
Only overwrite existing files if explicitly demanded.

Closes #6847

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-02-20 23:44:00 +03:00
Serge Logvinov
660b8874da
feat: cmdline integer netmask
Can set netmask as number.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-20 20:55:56 +04:00
Andrey Smirnov
6e8f13529c
fix: add support for a fallback '*' mirror configuration
Talos always supported that, but CRI config lacked support for it.

Now with recent containerd the new `_default` host is used as a
fallback, so this re-enables the support and updates the docs.

See https://github.com/containerd/containerd/pull/8065

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-16 23:12:13 +04:00
Artem Chernyshev
dcd4eb1a93
fix: improve error message on single node upgrade
Fixes: https://github.com/siderolabs/talos/issues/6828

Propose a solution if the node upgrade fails.

Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
2023-02-16 17:33:04 +03:00
Noel Georgi
2d01480180
feat: automatically load modules based on hw info
Fixes: #6802

Automatically load kernel modules based on hardware info and modules
alias info. udevd would automatically load modules based on HW
information present.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-14 19:57:13 +05:30
Noel Georgi
7b75cd8b94
fix: kernel module dependency tree generation
This fixes the issue when the overlay mount target directory was used as
lowerdir for the mount, creating extra folders in the extension.

Fix the issue by adding support for normal overlay mounts to use a
source directory when specified.

Also fixes a small issue where messages was logged when error is nil.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-14 01:07:11 +05:30
Noel Georgi
65d02e5ade
fix: dbus shutdown when it's not initialized
If dbus is not started and a shutdown was called talos panics, fix by
checking if the mock is nil.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-13 21:12:54 +05:30
Andrey Smirnov
a7079ce85c
fix: quote the ampersand character in GRUB config
Not sure how I missed it in the first PR, but that's the only character
which was not quoted properly.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-13 18:58:34 +04:00
Andrey Smirnov
dcbcf5a93c
fix: wait for network and retry in platform get config funcs
Wait for the network before trying to access the metadata service.

Retry the calls when appropriate (most platforms use `download.Download`
function which does proper retries).

Co-authored-by: Noel Georgi <git@frezbo.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-09 21:04:43 +04:00
Andrey Smirnov
e09e106665
fix: default dns domain to 'cluster.local' in local case
One case was missing: when network section is present, but value is
omitted.

Fixes #6825

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-08 14:35:28 +04:00
Noel Georgi
cc6e37a47f
feat: use process wrapper for dropping capabilities
Use process wrapper introduced in #6814 to drop capabilities. This change
also means the capabilities are dropped per process level and not for
PID 1 (machined), which allows us to drop capabilities per process.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-07 00:49:56 +05:30
Noel Georgi
5cb2915d8e
feat: use wrapper for starting processes
Use a wrapper for starting processes which can setup proper cgroups,
OOMscore, and also drop capabilities for the process, then it calls
`execve`.

The containerd tests is also fixed to support cgroups when
running tests in buildkit. It used to pass previously as we did not
error if cgroup setup failed.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-02-03 18:32:09 +05:30
Andrey Smirnov
38a51191e4
fix: correctly expand parameters in the URL
This fixes multiple issues:

* `log.Fatalf` in the machined code leads to kernel panic
* return URL if some expansion fails
* correctly handle destroyed event (wait for the next one)

Fixes #6807

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-02 18:42:45 +04:00
Andrey Smirnov
54f7d4c923
fix: correctly quote and unquote strings in GRUB config
One of the fields in the GRUB config - boot arguments - contains
user-controlled input. Talos supports variable expansion in
`talos.config` parameter, and uses `${var}` syntax.

In GRUB config, `}` is a special character, and introduction of `}`
breaks config parsing both for GRUB and Talos.

Correctly escape and unescape special characters.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-02-02 17:11:22 +04:00
Noel Georgi
590a393de9
fix: udevd healthcheck
The previous `udevd` healthcheck was incomplete and if `udevd` took more
time to startup the initial `udevadm trigger` would have silently failed
failing to setup proper devices. `udevadm trigger` returns an exit code
of zero even if `udevd` is not running. This PR fixes by first checking
if the `udevd` control socket exists, which is a faster check, then
making sure `udevd` is up by running `udevadm control` command. This
ensures that `udevd` is properly initialized before running any `udevadm
trigger` commands even if `udevd` is restarted/killed.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-01-30 16:12:41 +05:30
Noel Georgi
812a2877cd
chore: bump deps + renovate cleanup
Bump dependencies.
Disable renovate for PR's and skip un-needed update checks.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-01-24 00:42:58 +05:30
Andrey Smirnov
aa9f66c1c8
fix: mark DigitalOcean anchor IP as scope link
This excludes it out of the `NodeAddress`.

Needs extra testing to confirm that it actually still works as anchor
IP.

Fixes #6760

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-23 20:35:52 +04:00
Andrey Smirnov
3e00571627
fix: unwrap gRPC errors on stop/remove pods check
As the client returns wrapped errors, unwrap them using our own method
which does `errors.As` instead of gRPC one which doesn't do unwrapping.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-23 14:29:04 +04:00
Andrey Smirnov
00e52ae078
fix: build correctly etcd initial cluster URL
The supposed format with multiple adverised URLs is:

`name=u1,name=u2`

Previously Talos generated:

`name=u1,u2`

(which is wrong)

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-20 22:52:47 +04:00
Dmitriy Matrenichev
c5954f4345
chore: bump deps
For some reason `go-mod-outdated` didn't work for me, so I had to do
this manually.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-01-19 21:40:00 +03:00
Noel Georgi
d4b8b35de7
feat: generate kernel module dependency tree
Run `depmod` during install/upgrades when extensions provide kernel
modules and `modules.dep` needs to be re-generated. This also allows
modules of same name from kernel to co-exist. Modules in `extras`
folder takes precedence over `in-built` ones.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-01-19 18:54:10 +05:30
Andrey Smirnov
18122ae73e
fix: service restart (including extension services)
Fixes #6707

There was a race condition between different parts of the service code:
`Stop` waits for the event which is published before the service is
removed from the `running[id]` map, so if one does `Stop` followed by
`Start` (this is what `services restart` API does), by the time it goes
to `Start` it might be still in the `running[id]` map, so `Start` does
nothing.

Overall this code should be rewritten and simplified, but for now move
out sending these "terminal" events out so that by the time the event is
published, the service is stopped and removed from the `running[id]`
map.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-18 14:52:47 +04:00
Andrey Smirnov
0b65bbfc87
fix: handle overwriting tags in syslinux ADV
This is (still) being used in Talos to handle upgrade rollbacks.

There were multiple problems with this code, and one of them leads to
panic if the tag is written multiple times without deletion:

```
github.com/siderolabs/talos/internal/app/machined/pkg/runtime/v1alpha1/bootloader/adv/syslinux.ADV.SetTagBytes({0xc00175bc00?, 0x1f11dbe?, 0xed4f4d?}, 0x0?, {0xc000afb7f0?, 0x400?, 0x0?})
/src/internal/app/machined/pkg/runtime/v1alpha1/bootloader/adv/syslinux/syslinux.go:125 +0x270
github.com/siderolabs/talos/internal/app/machined/pkg/runtime/v1alpha1/bootloader/adv/syslinux.ADV.SetTag(...)
/src/internal/app/machined/pkg/runtime/v1alpha1/bootloader/adv/syslinux/syslinux.go:95
github.com/siderolabs/talos/cmd/installer/pkg/install.(*Installer).Install(0xc0004374a0, 0x5)
/src/cmd/installer/pkg/install/install.go
```

The `uint8()` conversion was causing overflow and wrong index when ADV
real length is over 255.

Fix multiple writes of the same tag by deleting previous value first.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-17 23:21:39 +04:00
Serge Logvinov
70d9428a1d
fix: kubespan MSS clamping
Change TCP maximum segment size if it goes through the KubeSpan to match
KubeSpan MTU.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-17 19:02:33 +04:00
Andrey Smirnov
062c7d754b
test: fix integration test on cp endpoint update
As with #6724, controlplane node kubelet doesn't use control plane
endpoint anymore, run the test on the worker node instead of cp node.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-12 15:23:14 +04:00
Andrey Smirnov
0a5a8802e7
feat: use 'localhost' endpoint for controlplane nodes
This switches the last usage of Kubernetes controlplane endpoint to use
`localhost` (itself) for controlplane nodes.

Worker nodes still use cluster-wide controlplane endpoint.

This allows controlplane nodes to boot fully even if the controlplane
endpoint (e.g. loadbalancer) doesn't function.

The process of joining etcd still requires either a discovery service or
a proper functioning controlplane endpoint.

With this fix, Talos controlplane nodes can boot successfully without a
loadbalancer being up, while worker nodes obviously won't join.

This improves Talos behavior in single-node clusters when controlplane
endpoint is not available, the node will still boot just fine and
function properly.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-10 20:50:51 +04:00
Andrey Smirnov
29020cb9c7
fix: report fatal sequence errors as reboots
When the sequence fails hard, Talos does automatic reboot, so reflect
this in the machine status properly.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-10 14:24:23 +04:00
Andrey Smirnov
96629d5ba6
feat: implement etcd maintenance commands
This allows to safely recover out of space quota issues, and perform
degragmentation as needed.

`talosctl etcd status` command provides lots of information about the
cluster health.

See docs for more details.

Fixes #4889

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-03 23:25:28 +04:00
Andrey Smirnov
80fed31940
feat: include Kubernetes controlplane endpoint as one of the endpoints
These endpoints are used for workers to find the addresses of the
controlplane nodes to connect to `trustd` to issue certificates of
`apid`.

These endpoints today come from two sources:

* discovery service data
* Kubernetes API server endpoints

This PR adds to the list static entry based on the Kubernetes control
plane endpoint in the machine config.

E.g. if the loadbalancer is used for the controlplane endpoint, and that
loadbalancer also proxies requests for port 50001 (trustd), this static
endpoint will provide workers with connectivity to trustd even if the
discovery service is disabled, and Kubernetes API is not up.

If this endpoint doesn't provide any trustd API, Talos will still try
other endpoints.

Talos does server certificate validation when calling trustd,
so including malicious endpoints doesn't cause any harm, as malicious
endpoint can't provider proper server certificate.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-01-03 21:33:18 +04:00
Serge Logvinov
80f150ac85
feat: enable ipv6 on gcp
Introduce ipv6 to the google cloud.
It also can work with dhcpv6 is on.
But the route receives through RA packages which not working.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-28 14:45:49 +04:00
Serge Logvinov
f6a86ae906
fix: oralce cloud zone
Zone definition misspell.
Native services use uppercase zone.

Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-26 14:49:26 +04:00
Andrey Smirnov
89dbb0ecf0
release(v1.4.0-alpha.0): prepare release
This is the official v1.4.0-alpha.0 release.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-23 22:32:09 +04:00
Andrey Smirnov
31fb905358
feat: update Linux 6.1.1, containerd 1.6.14
Bumps tools/pkgs/extras to the latest.

Bumps Go modules.

Enables adaptive capacity for COSI state.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-23 20:30:09 +04:00
Andrey Smirnov
a0c0352ddc
fix: send diagnostic output to stderr consistently
Fixes #6676

There was a mix of stdout/stderr, move more consistently to stderr.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-23 18:41:56 +04:00
Andrey Smirnov
9a5f4c08a2
fix: default the manifest namespace if not set
This seems to happen specifically for CRDs, regular Kubernetes resources
have some extra magic.

Fixes #6663

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-22 16:20:46 +04:00
Niklas Wik
34babe858d
chore: make organization selection an interface
Making organization a interface for preparing to avoid giving
system:masters access to the talosctl kubeconfig generated certificate.

Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-19 15:12:30 +04:00
Nico Berlee
171aa94679
fix: disable Wireless Lan using dtoverlay
Talos has no wireless support & wireless kernel drivers,
so disabling it the recommended way might actually might save power consumption.

It could save ~45 mA:
https://forums.raspberrypi.com/viewtopic.php?t=257144#p1568474

Or 'The WiFi half of the wireless chip will be powered but be held in reset':
https://forums.raspberrypi.com/viewtopic.php?t=343854#p2060246

Either way, it does not hurt and it should be treated the same as bluetooth.

Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2022-12-17 01:48:43 +05:30
Dmitriy Matrenichev
eb332cfcb7
feat: add health check for a minimal memory / disk size
This PR adds two additional checks which are performed during boot sequence and in `talosctl health`. They ensure that nodes have enough memory and disk.

- Boot check will print a warning if memory / disk size is not sufficient.
- Health check will fail if memory / disk size is not sufficient.

Closes #6467

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-12-10 07:05:08 +03:00
Andrey Smirnov
d04970dfa9
fix: ignore k8s additional addresses if nil
This fixes a potential panic which I found in the unit-tests logs.

The error 'not found' is ignored, so need an addiitonal check.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-09 19:25:07 +04:00
Dmitriy Matrenichev
a8ebcca4a9
chore: remove watchErr from metal.getResource
It's only used to detect if resource is `nil` or of incorrect type. Both errors are developer errors, so we should not collect them.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-12-06 22:04:28 +03:00
Dmitriy Matrenichev
1253513bd1
fix: fix nil pointer panic and incorrect error output
Currently `.Error()` call is panicking if `watchErr` is nil. Besides - we want to wrap errors the way we can unwrap them.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2022-12-06 21:03:25 +03:00
Andrey Smirnov
82e8c9e1f6
fix: workaround panic in the kubelet service controller
The traceback:

```
user: warning: [2022-12-02T17:31:09.496341098Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletServiceController", "error": "controller \x5c"k8s.KubeletServiceController\x5c" panicked: runtime error: invalid memory address or nil pointer dereference\x5cn\x5cngoroutine 308 [running]:\x5cnruntime/debug.Stack()\x5cn\x5ct/toolchain/go/src/runtime/debug/stack.go:24 +0x65\x5cngithub.com/cosi-project/runtime/pkg/controller/runtime.(*adapter).runOnce.func2()\x5cn\x5ct/.cache/mod/github.com/cosi-project/runtime@v0.1.1/pkg/controller/runtime/adapter.go:403 +0x5d\x5cnpanic({0x2b7b600, 0x536c7c0})\x5cn\x5ct/toolchain/go/src/runtime/panic.go:884 +0x212\x5cngithub.com/talos-systems/talos/internal/app/machined/pkg/controllers/k8s.updateKubeconfig(0xc0000d49b0?)\x5cn\x5ct/src/internal/app/machined/pkg/controllers/k8s/kubelet_service.go:302 +0xb8\x5cngithub.com/talos-systems/talos/internal/app/machined/pkg/controllers/k8s.(*KubeletServiceController).Run(0xc000956030, {0x389f7c0, 0xc000808040}, {0x38bce60, 0xc0000dfa80}, 0x0?)\x5cn\x5ct/s...
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2022-12-06 20:53:30 +04:00