IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Ignore kernel command line for `SideroLink` and `EventsSink` config when
running in container mode. Otherwise when running Talos as a docker
container in Talos it picks up the host kernel cmdline and try to
configure SideroLink/EventsSink.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Start container early in the boot process so system extension services
start in maintenance mode.
Fixes: #7083
Signed-off-by: Noel Georgi <git@frezbo.dev>
Fixes#7873
Some services which perform mounts inside the container which require
mounts to propagate back to the host (e.g. `stargz-snapshotter`) require
this configuration setting.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Support different providers, not only static file paths.
Drop `pcr-signing-key-public.pem` file, as we generate it on the fly
now.
See https://github.com/siderolabs/image-factory/issues/19
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This PR does those things:
- It allows API calls `MetaWrite` and `MetaRead` in maintenance mode.
- SystemInformation resource now waits for available META
- SystemInformation resource now overwrites UUID from META if there is an override
- META now supports "UUID override" and "unique token" keys
- ProvisionRequest now includes unique token and Talos version
For #7694
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Previously a fix was deployed in the Talos API client, but when the
request passes through `apid`, we need to make sure that proxy doesn't
reject large responses.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Move `setupLogging` inside the controller, so that logger is set up
correctly before Talos starts printing first messages.
This fixes an inconsistency that first messages are printed using
"default" logger, while after that the proper logger is set up, and
format of the messages matches kernel log.
Also move `waitForUSBDelay` into the sequencer after `udevd` was
started (this is when blockdevices including USB ones are discovered).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
First of all, this interface is way more performant than `pcap`
interface. It is Linux-specific, but we don't care in Talos Linux :)
Second, this drop dependency of `machined` on `gopacket/layers` package,
which has huge issues with memory allocations and startup time.
This cuts around 20MiB of process RSS for all Talos processes.
(`talosctl` still requires this `gopacket/layers` library for decoding
packets).
Fixes#7880
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
As Talos doesn't consume `.machine.install` if already installed, there
is no point in validating it once already installed.
This fixes a problem users often run into: after a reboot/upgrade the
system disk blockdevice name changes, due to the kernel upgrade, or just
unpredictable behavior of device discovery. Talos fails to boot as it
can't validate the machine config, while it's already installed, so
actual blockdevice name doesn't matter.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
See https://github.com/siderolabs/image-factory/issues/44
Instead of using constants, use proper Talos version and kernel version
discovered from the image.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This leads to lots of unnecessary improts, as the chain from network
controllers is pretty long.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Use fixed partition name instead of trying to auto-discover by label.
Auto-discovery by label might hit completely wrong blockdevice.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
See https://github.com/siderolabs/image-factory/issues/43
Two fixes:
* pass path to the dtb, uboot and rpi-firmware explicitly
* include dtb, uboot and rpi-firmware into arm64 installer image when
generated via imager (regular arm64 installer was fine)
(The generation of SBC images was not broken for Talos itself, but only
when used via Image Factory).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Before we started a reboot/shutdown/reset/upgrade action with the action tracker (`--wait`), we were setting a flag to prevent cobra from printing the returned error from the command.
This was to prevent the error from being printed twice, as the reporter of the action tracker already prints any errors occurred during the action execution.
But if the error happens too early - i.e. before we even started the status printer goroutine, then that error wouldn't be printed at all, as we have suppressed the errors.
This PR moves the suppression flag to be set after the status printer is started - so we still do not double-print the errors, but neither do we suppress any early-stage error from being printed.
Closessiderolabs/talos#7900.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This allows to "recover" secrets if the machine config was generated
first without explicitly saving secrets bundle.
Fixes#7895
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This commit integrates the GOMEMLIMIT environment variable into shipped K8S
manifests when resources.limits.memory is defined. It is set to 95% of the
memory limit to optimize the performance of the Go garbage collector,
mitigating the risk of OOMKills in containerized environments.
When configuring the controller-manager or scheduler custom resources in
machine config, they where accepted, but ignored.
This commit adds Resources to NewControlPlaneSchedulerController and
NewControlPlaneControllerManagerController so machine config resources
Fixes#7874
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Use MAC address over network interface name.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
* Set static gateway IPv6 if it possible.
Some cni do not work properly with ipv6, so we will fix it.
* Disable talos dashboard.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
There were weird hacks put into the tests, while each test already runs
in a temporary directory as 'working directory', so no hacks are needed.
Moreover, using fixed `/tmp/...` paths leads to test failures, as CI
runs docker & QEMU tests in parallel conflicting with each other.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
* support for local-hostname parameter
* support for hostnames passed via user-data (for Proxmox VE)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
First of all, it breaks our backwards compatibility promises and breaks
documentation generation. Upstream `specs.Mount` might change at any
time.
The issue was that containerd 1.7.x brings in new `specs.Mount` which
contains extra fields which don't have `omitempty` for YAML, so
machinery always generates them which confuses old Talos versions.
Use a copy of the upstream struct with proper YAML tags, and also
provide a special trick to make sure if the upstream struct changes, we
have a chance to update our copy of the struct.
Also this fixes docs and JSON schema.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This does not fix the underlying digest mismatch issue, but does handle the error and should provide
further insight into issues (if present).
Refs: #7828
Signed-off-by: Thomas Way <thomas@6f.io>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
The conversion from TPM 2 hash algorithm to Go crypto algorithm will fail for
uncommon algorithms like SM3256. This can be avoided by checking the constants
directly, rather than converting them. It should also be fine to allow some non
SHA-256 PCRs.
Fixes: #7810
Signed-off-by: Thomas Way <thomas@6f.io>
Signed-off-by: Noel Georgi <git@frezbo.dev>
drop `UpdateEndpointSuite` suite since KubePrism is enabled by default
starting Talos 1.6 and the test never passes since K8s node is always
ready since it can connect to api server over KubePrism.
Signed-off-by: Noel Georgi <git@frezbo.dev>
When STATE is reset, we need to make sure we wipe the META keys
containing encryption config as well.
Fixes#7819
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This was a fix some time ago, but it was incorrect (missing `continue`),
which was failing the unit-tests.
Also fix a data race in another unit-test (which is unit-test only, not
affecting production).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
When running on the machine, the extensionTreePath is not writeable, so
create and clean up a temporary directory to host `modules.dep`
extension.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Drop loop device/mounts completely, use userspace utilities to extract
and lay over module trees in the tmpfs.
Discover kernel version automatically instead of hardcoding it to be
current one (required for Image Service).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Otherwise route gets created with priority '0' and it seems to get into
conflict with what Cilium tries to add.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Fixes#7738
If the SideroLink address changes, maintenance service should listen on
new address. Previously it worked "sometimes", as there was a race on
maintenance config either be removed/recreated or just updated. In case
of an update the listen address was not updated properly, but recreate
case worked correctly.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This feature allows us to remove any comments from the machineconfig after
upgrading Kubernetes.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Add output flag for `talosctl config info`.
This allows to programatically gather endpoints for CI tests.
Eg:
```bash
_out/talosctl-linux-amd64 config info --output json | jq '.Contexts[].Endpoints[0]'
```
Signed-off-by: Noel Georgi <git@frezbo.dev>
Fixes#7679
This should be no-op if the link name is <= 10 chars, but with
predictable interface names based on MAC addresses, they have to be
shortened to make some space for VLAN ID.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Fixes#7698
Also fix `talosctl config info` for `talosconfig` without a client
certificate (e.g. Omni-generated one).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
The default timeouts are very aggressive, and we should use explicit
timeouts so that healh checks don't run that often.
Fixes#7690
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This is not a problem in general, but when running multiple image
generation procedures using the same mount point is a problem.
This is a no-op if `MountPrefix` is not set (when installing/upgrading
vs. creating an image).
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This fixes a problem in the `RouteSpecController` which is due to a
subtle (but correct) change in the behavior in the `stdlib`.
Also some small (but should be safe) bumps.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Example: host has address `10.0.0.1/8`, while Kubernetes pod CIDR is
`10.244.0.0/16`. These two subnets overlap, but the address `10.0.0.1`
isn't contained in the `10.244.0.0/16` subnet.
This change fixes the check to make sure address is not contained vs.
the address subnet overlaps with the filter.
NB: this is still a bad idea to have host network subnet to overlap with
Kubernetes pod/service CIDRs.
Also refactor the unit-tests to use new (better ways) to do assertions.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Move drone extensions integration to a function. This allows us to
re-use the code and just depend on a single step rather than explicitly
defining all dependencies.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This is required for https://github.com/siderolabs/sidero/pull/1070, as
we need to allow DHCP traffic from Sidero controller running in a VM
through the bridge to other VMs.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Processes and their info are not guaranteed to be present on the api-based data gathered by the dashboard. Therefore, we switch to using nil-safe access to the CPU time when rendering the process table.
Closessiderolabs/talos#7645.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fixes#7615
This extends the previous handling when Talos did `ToLower()` on the
hostname to do the full filtering as expected.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
This is a follow-up fix for #7640
I noticed that image cleanup controller cleans up the images if
specified with both tag and digest.
The problem was incorrectly building image references in the expected
set of images, so they were incorrectly marked as unused.
Refactor the code to make the core part testable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Fixes#7636
This support a `List`-type manifests by unwrapping them into individual
objects.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
First of all, it seems to be "right way", as it makes sure the image is
looked up by the digest.
Second, it fixes the case when image is specified with both tag and
digest (which is not supposed to be the correct ref, but it is used
frequently).
Talos since 1.5.0 stores images with the following aliases:
```
gcr.io/etcd-development/etcd:v3.5.9
gcr.io/etcd-development/etcd@sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
ghcr.io/siderolabs/kubelet:v1.28.0
ghcr.io/siderolabs/kubelet@sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:50710f2cd3328c23f57dfc7fb00940d8cfd402315e33fc7cb8184fc660650a5c
sha256:8c956d9b0d39745fa574bb4dbacd362ffdc1109479432f54094859d4cf984b17
```
This change pulls the digest format (the last in this list) and uses it
to start a container.
Fixes#7640
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Short version is: move from global variables/`init()` function into
explicit functions.
`docgen` was updated to skip creating any top-level global variables,
now `Doc` information is generated on the fly when it is accessed.
Talos itself doesn't marshal the configuration often, so in general it
should never be accessed for Talos (but will be accessed e.g. for
`talosctl`).
Machine config examples were changed manually from variables to
functions returning a value and moved to a separate file.
There are no changes to the output of `talosctl gen config`.
There is a small change to the generated documentation, which I believe
is a correct one, as previously due to value reuse it was clobbered with
other data.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Fixes#7592
The problem was a mismatch between a "primary key" (ID) of the
`RouteSpec` and the way routes are looked up in the kernel - with two
idential routes but different priority Talos would end up in an infinite
loop fighting to remove and re-add back same route, as priority never
matches.
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Use `Progress`, and options to pass around the way messages are written.
Fixed some tiny issues in the code, but otherwise no functional changes.
To make colored output work with `docker run`, switched back image
generation to use volume mount for output (old mode is still
functioning, but it's not the default, and it works when docker is not
running on the same host).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
There are two changes here:
* build `machined` binary with `tcell_minimal` tag (which disables
loading some parts of the terminfo database), which also affects
`apid`, `trustd` and `dashboard` processes, as they run from the same
executable; in `dashboard` explicitly import `linux` terminal we're
using when the `dashboard` runs on the machine
* pass `TCELL_MINIMIZE=1` environment variable to each Talos process
which removes 0.5MiB of runewdith allocation for a lookup table
See #7578
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Currently, we use `github.com/coreos/go-semver/semver` and `github.com/hashicorp/go-version`
for version parsing. As we use `github.com/blang/semver/v4` in our other projects, and it
has more features, it makes sense to use it across the projects. It also doesn't allocate
like crazy in `KubernetesVersion.SupportedWith`.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fixes#7080
The real bug was off-by-one in `log2i` implementation, other changes are
cleanups as `x/sys/unix` package now contains all the constants we need.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR updates the ProviderID format for aws resources. There seems to
be a bug when using Talos CCM (which consumes this value from Talos)
because the format is `aws://x/y` (two slashes) vs. the expected
`aws:///x/y` (three slashes) that is set with the AWS CCM code
[here](d055109367/pkg/providers/v1/instances.go (L47-L53)).
Setting only two slashes causes important software in the workload
cluster to fail, specifically cluster-autoscaler. The regex they use for
pulling providerID is [here](702e9685d6/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go (L195)).
Signed-off-by: Spencer Smith <spencer.smith@talos-systems.com>
Fixes#7558
I see no reason to keep old behavior (removing all partitions on the
disk), as it's only compatible with Talos itself.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Support full configuration for image generation, including image
outputs, support most features (where applicable) for all image output
types, unify image generation process.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
* support for bonding
* added interface selection by MAC address
* added routes management
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
After we closed `kubelet`, remove `/var/lib/kubelet/cpu_manager_state` if there are any changes in `cpuManagerPolicy`.
We do not add any other safeguards, so it's user responsibility to cordon/drain the node in advance.
Also minor fixes in other files.
Closes#7504
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Add security state resource that describes the state of Talos SecureBoot
and PCR signing key fingerprints.
The UKI fingerprint is currently not populated.
Fixes: #7514
Signed-off-by: Noel Georgi <git@frezbo.dev>
There are no functional changes, but SDK got updated to handle int ->
int64 changes. v1 version is only supported to Sep 2023.
See https://github.com/hetznercloud/hcloud-go#support
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This new fork seems to be more active. The change itself doesn't fix any
memory allocation, but I submitted a PR for gopacket/gopacket:
https://github.com/gopacket/gopacket/pull/24
Also fix crazy alloc in `tui/components` (this is only relevant for
`talosctl`).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This module was imported just for a single Go struct (for XML
unmarshalling), and it could be easily internalized.
The module causes significant allocation on startup:
```
init github.com/vmware/govmomi/vim25/types @23 ms, 1.4 ms clock, 1269864 bytes, 196 allocs
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Both are much modular and pull in much less dependendencies in to the
Talos tree.
This solves the problem with allocations in AWS endpoints on import, and
removes a bunch of dependencies.
Raw binary size: -10 MiB.
Memory usage (not scientific): around -5 MiB for all Talos services.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
As part of bootloader refactoring `go-blockdevice` was used for wiping
partitions in #7329, but used standard wipe which could be fast/slow
depending on the blockdevice support. Switch to using fast-wipe for
partitions. This should not affect `wipe` option in machineconfig.
Fixes: #7531
Signed-off-by: Noel Georgi <git@frezbo.dev>
- Make dashboard SIGTERM-aware
- Handle panics on dashboard and terminate it gracefully, so it resets the terminal properly
- Switch to TTY2 when it starts and back to TTY1 when it stops.
Closessiderolabs/talos#7516.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This refactors code to handle partial machine config - only multi-doc
without v1alpha1 config.
This uses improvements from
https://github.com/cosi-project/runtime/pull/300:
* where possible, use `TransformController`
* use integrated tracker to reduce boilerplate
Sometimes fix/rewrite tests where applicable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7453
The goal is to make it possible to load some multi-doc configuration
from the platform source (or persisted in STATE) before machine acquires
full configuration.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7487
When `.kubelet.nodeIP` filters yield no match, Talos should not start
the kubelet, as using empty address list results in `--node-ip=` empty
kubelet arg, which makes kubelet pick up "the first" address.
Instead, skip updating (creating) the nodeIP and log an explicit
warning.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This adds basic support for shutdown/poweroff flags.
it can distringuish between halt/shutdown/reboot.
In the case of Talos halt/shutdown is same op.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Use the `machined` socket for `shutdown` and `poweroff` aliases. This
ensures that worker nodes does not have to wait on apid to start.
Signed-off-by: Noel Georgi <git@frezbo.dev>
We can safely do it on `io.Writer` level, since `log.Logger.Output` (called by `Print|Printf`) pretty much promises
that every call to `Write` ends with `\n`.
Closes#7439
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
For rng seed and pcr extend, let's ignore if the device is not TPM2.0
based. Seal/Unseal operations would still error out since it's
explicitly user enabled feature.
Signed-off-by: Noel Georgi <git@frezbo.dev>
The previous flow was using TPM PCR 11 values to bound the policy which
means TPM cannot unseal when UKI changes. Now it's fixed to use PCR 7
which is bound to the SecureBoot state (SecureBoot status and
Certificates). This provides a full chain of trust bound to SecureBoot
state and signed PCR signature.
Also the code has been refactored to use PolicyCalculator from the TPM
library.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This is intemediate step to move parts of the `ukify` down to the main
Talos source tree, and call it from `talosctl` binary.
The next step will be to integrate it into the imager and move `.uki`
build out of the Dockerfile.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
It seems that CRI has a bit of eventual consistency, and it might fail
to remove a stopped pod failing that it's still running.
Rewrite the upgrade API call in the upgrade test to actually wait for
the upgrade to be successful, and fail immediately if it's not
successful. This should improve the test stability and it should make
it easier to find issues immediately.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6391
Implement a set of APIs and commands to manage images in the CRI, and
pre-pull images on Kubernetes upgrades.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The Go modules were not tagged for alpha.4, so using alpha.3 tag.
Talos 1.5 will ship with Kubernetes 1.28.0.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7430
Introduce a set of resources which look similar to other API
implementations: CA, certs, cert SANs, etc.
Introduce a controller which manages the service based on resource
state.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7379
Add possibility to configure the controlplane static pod resources via
APIServer, ControllerManager and Scheduler configs.
Signed-off-by: LukasAuerbeck <17929465+LukasAuerbeck@users.noreply.github.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos now supports new type of encryption keys which rely on Sealing/Unsealing randomly generated bytes with a KMS server:
```
systemDiskEncryption:
ephemeral:
keys:
- kms:
endpoint: https://1.2.3.4:443
slot: 0
```
gRPC API definitions and a simple reference implementation of the KMS server can be found in this
[repository](https://github.com/siderolabs/kms-client/blob/main/cmd/kms-server/main.go).
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Fixes#7228
Add some changes to make Talos accept partial machine configuration
without main v1alpha1 config.
With this change, it's possible to connect a machine already running
with machine configuration (v1alpha1), the following patch will connect
to a local SideroLink endpoint:
```yaml
apiVersion: v1alpha1
kind: SideroLinkConfig
apiUrl: grpc://172.20.0.1:4000/?jointoken=foo
---
apiVersion: v1alpha1
kind: KmsgLogConfig
name: apiSink
url: tcp://[fdae:41e4:649b:9303::1]:4001/
---
apiVersion: v1alpha1
kind: EventSinkConfig
endpoint: "[fdae:41e4:649b:9303::1]:8080"
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Previously, if META values were supplied to the Talos ISO via
environment variable, they will be written down and available after the
install. With this fix, values are also readable and available before
the installation runs (in maintenance mode).
Most of the PR is refactoring `meta.Value(s)` to be a shared library
which is used by the installer/imager and (now) Talos.
Also fixes an issue with not returning properly `NotExist` error when
META is not yet available as a partition on disk.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Some tools like qemu-guest-agent when ran as a extension service calls
`/sbin/shutdown` instead of `/sbin/poweroff`. This adds handling for the
same.
Ref: https://github.com/siderolabs/extensions/pull/173
Signed-off-by: Noel Georgi <git@frezbo.dev>
Allow specifying the reboot mode during upgrades by introducing `--reboot-mode` flag, similar to the `--mode` flag of the reboot command.
Closessiderolabs/talos#7302.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This makes handling of `exec` more flexible.
Signed-off-by: Markus Reiter <me@reitermark.us>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This includes sd-boot handling, EFI variables, etc.
There are some TODOs which need to be addressed to make things smooth.
Install to disk, upgrades work.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This commit adds support for API load balancer. Quick way to enable it is during cluster creation using new `api-server-balancer-port` flag (0 by default - disabled). When enabled all API request will be routed across
cluster control plane endpoints.
Closes#7191
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Move labels out of the bootloader interface, while moving copying assets
into the bootloader interface. GRUB is using one set of assets,
`sd-boot` will be using another one.
Fix the problem with `bootloader.Probe()` finding boot partition on the
host when it runs in a priv container, fixing issues with image creation
in the CI.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
`WITH_CONFIG_PATCH_WORKER` check result was overriding any value set in `CONFIG_PATCH_FLAG` variable.
Move it to the different variable.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This changes the bootloader code to be generic to support
multiple bootloader implementations.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7233
Waiting for node readiness now happens in the `MachineStatus` controller
which won't mark the node as ready until Kubernetes `Node` is ready.
Handling cordoning/uncordining happens with help of additional resource
in `NodeApplyController`.
New controller provides reactive `NodeStatus` resource to see current
status of Kubernetes `Node`.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7233
The controlplane label is simply injected into existing controller-based
node label flow.
For controlplane taint default NoScheduleTaint, additional controller &
resource was implemented to handle node taints.
This also fixes a problem with `allowSchedulingOnControlPlanes` not
being reactive to config changes - now it is.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Because `SetConfig` can be called concurrently with `Config` there is risk of data race, if something goes wrong. Since `config.Provider` is an interface type, it means its size is two machine words. And so in very unpleasant situations it can lead to arbitrary RCE, because interface variable can be in partially updated state.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
I ended up completely rewriting the controller, simplifying the flow
(somewhat) so that there's just a single control flow in the controller,
while reading from v1alpha1 events is converted to reading from a
channel.
Fixes#7227
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7333
Also fixed the discovery service controller to reconnect the client on
config changes (previously it wasn't reactive on e.g. URL changes).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Use the `go-blockdevice` library to zero partitions.
Also added a test that writes `ones` to the partition and verifies its
zeroes after zeroing it.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This changes the mounting/unmounting of `BOOT` partiton code into
`kexecPrepare` phase. Also skips if `BOOT` partition cannot be found.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Fixes#7226
This follows same flow as other similar changes - split out logging
configuration as a separate resource, source it for now in the cmdline.
Rewrite the controller to allow multiple log outputs, add send retries.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR adds support for creating a list of API endpoints (each is pair of host and port).
It gets them from
- Machine config cluster endpoint.
- Localhost with LocalAPIServerPort if machine is control panel.
- netip.Addr[0] and port from affiliates if they are control panels.
For #7191
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Use `udevd` rules to create stable interface names.
Link controllers should wait for `udevd` to settle down, otherwise link
rename will fail (interface should not be UP).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
If the dashboard is run without the "Config URL" screen, do not initialize it, and do not probe the kernel args for the code parameter.
Refactor the dashboard to do not construct the unused screens at all.
Closessiderolabs/talos#7300.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
RENEW packets are sent unicast, so Talos needs the address of the DHCP
server to send RENEW packets to.
Fixes#7211Fixes#7263
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
`config.Container` implements a multi-doc container which implements
both `Container` interface (encoding, validation, etc.), and `Conifg`
interface (accessing parts of the config).
Refactor `generate` and `bundle` packages to support multi-doc, and
provide backwards compatibility.
Implement a first (mostly example) machine config document for
SideroLink API URL.
Many places don't properly support multi-doc yet (e.g. config patches).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is a port of ukify.py and systemd-measure from systemd.
This requires no actual TPM to be present to calculate the PCR
signatures.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
See #7230
Refactor more config interfaces, move config accessor interfaces
to different package to break the dependency loop.
Make `.RawV1Alpha1()` method typed to avoid type assertions everywhere.
No functional changes.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
See #7230
This is a step towards preparing for multi-doc config.
Split the `config.Provider` interface into parts which have different
implementation:
* `config.Config` accesses the config itself, it might be implemented by
`v1alpha1.Config` for example
* `config.Container` will be a set of config documents, which implement
validation, encoding, etc.
`Version()` method dropped, as it makes little sense and it was almost
not used.
`Raw()` method renamed to `RawV1Alpha1()` to support legacy direct
access to `v1alpha1.Config`, next PR will refactor more to make it
return proper type.
There will be many more changes coming up.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7246
The problem was that `udevd` watches via `inotify` any attempts to open
blockdevices with 'write' access.
Talos was opening with write access, but actually accessing as
read-only, so the fix is to open as read-only.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This is controlled with a feature flag which gets enabled automatically
for Talos 1.5+.
Fixes#7181
If enabled, configures kubelet to use project quotas to track xfs volume
usage, which is much more efficient than doing `du` periodically.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Kubelet doesn't refresh self-issued serving certificates, so force it by
removing the cert on each restart.
Fix the code which was forcing rejoin when the nodename changes, it was
broken, as it was checking serving certificate instead of client
certificate. It worked by accident when not using controlplane-issued
serving certificates.
Fixes#7235
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
* drop old resources API, which was deprecated long time ago
* use bootstrapped event in `talosctl get --watch` to better align
columns in the table output
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
- github.com/containerd/typeurl to v2.1.1
- github.com/aws/aws-sdk-go to v1.44.264
- alpine to 3.18.0
- node to 20.2.0-alpine
- github.com/containernetworking/plugins to v1.3.0
- github.com/docker/docker to v23.0.6+incompatible
- github.com/hetznercloud/hcloud-go to v1.45.1
- github.com/insomniacslk/dhcp to v0.0.0-20230516061539-49801966e6cb
- github.com/rivo/tview to v0.0.0-20230511053024-822bd067b165
- tools to v1.5.0-alpha.0-7-gd2dde48
- pkgs to v1.5.0-alpha.0-16-g7958db1
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
This reverts commit a2565f6741.
The fix done in `a2565f67`, was actually a no-op caused by the
misunderstanding the fix done in Go and backported to [Go 1.20.4](ecf7e00db8).
The fix gave a false confidence that it was working when it was tested
against Talos `main` branch since the PR #7190 bumped `x/sys` package
from [v0.7.0 -> v0.8.0](ecf7e00db8), the actual change in `x/sys` can be found here at ff18efa0a3 which meant that when updating Go to 1.20.4 the `x/sys` package should been updated too. The `x/sys` package changed how the syscall to set the rlimit was called, it got moved into the Go stdlib instead of calling rlimit syscall in the `x/sys` package, which meant a combination of using Go 1.20.4 and an older `x/sys` package means `RLIMIT_NOFILE` value would not be set back to the original value.
The Talos 1.4 release branch currently have `x/sys`
at [v0.7.0(https://github.com/siderolabs/talos/blob/v1.4.3/go.mod#L133),
so the backport would consist of this change along another commit bumping `x/sys` package to `v0.8.0`.
Fixes: #7198Fixes: #7206
Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
This bug is pretty cosmetic, but it shows up as a wrong check when
performing worker upgrade - Talos pretends it checks e.g. kube-apiserver
version which doesn't make sense for workers.
There were two bugs in the code:
* check for machine type was done against `TypeWorker`, while
`MachineType` resource is initially created as `TypeUnknown`
* the cleanup code was not implemented
As I touched the code, I updated controller and tests to use modern
conventions.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Introduce a new resource, `SiderolinkConfig`, to store SideroLink connection configuration (api endpoint for now).
Introduce a controller for this resource which populates it from the Kernel cmdline.
Rework the SideroLink `ManagerController` to take this new resource as input and reconfigure the link on changes.
Additionally, if the siderolink connection is lost, reconnect to it and reconfigure the links/addresses.
Closessiderolabs/talos#7142, siderolabs/talos#7143.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Fixes#7159
The change looks big, but it's actually pretty simple inside: the static
pods had an annotation which tracks a version of the secrets which
forced control plane pods to reload on a change. At the same time
`kube-apiserver` can reload certificate inputs automatically from files
without restart.
So the inputs were split: the dynamic (for kube-apiserver) inputs don't
need to be reloaded, so its version is not tracked in static pod
annotation, so they don't cause a reload. The previous non-dynamic
resource still causes a reload, but it doesn't get updated when e.g.
node addresses change.
There might be many more refactoring done, the resource chain is a bit
of a mess there, but I wanted to keep number of changes minimal to keep
this backportable.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos doesn't have `rpc.statsd` running, so mounting without locking is
the only option. Some places in Kubernetes don't allow to set mount
options for NFS, so setting defaults is the only way.
Fixes#6582
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Ensure to wait as long as possibly given to kubelet shutdown timers.
Related to fix of siderolabs#7138
Signed-off-by: Niklas Wik <niklas.wik@nokia.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7137
The `umount` syscall might hang "forever" if the underlying network
filesystem endpoint is down.
To be on the safe side, add a timeout around unmount operations, and try
to umount with force as a last resort.
Sample log:
```
14795.458779] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com/dbe8d7f58e21d06cbef1ae0849317661eba4e82776722e7db5c65194ad73e916/globalmount/0001-0009-rook-ceph-0000000000000001-1051beb3-8d7a-4291-bf45-5711c13523d1
[14795.459797] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount
[14795.460555] EXT4-fs (rbd0): unmounting filesystem.
[14813.461319] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 1m11.999162834s
[14831.460813] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 53.999567033s
[14849.461336] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 35.998979117s
[14867.460748] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount is taking longer than expected, still waiting for 17.999502128s
[14885.461123] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount with force
[14885.462395] [talos] ignoring unmount error /var/lib/kubelet/pods/f3f4d789-7f48-4dd9-9ef5-649b002c8f9c/volumes/kubernetes.io~csi/pvc-a4e72749-a8a1-43d9-9152-5bc1f757c924/mount: invalid argument
[14885.463529] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-0888dc71-ba9e-af8a-d322-074f654561e5
[14885.464267] [talos] task unmountPodMounts (2/2): done, 1m30.028862262s
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
API server takes care of setting priority for "regular" pods from
priorityClassName, but nothing does that for static pods, so we have to
specify the priotity explicitly for static pods.
This fixes the graceful node shutdown (kubelet) to stop non-critical
pods before the api-server and friends (critical pods).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fix udevd not triggering rules properly.
This also fixes an issue with go-blockdevice not resolving symlinks.
Fixes: #7117
Signed-off-by: Noel Georgi <git@frezbo.dev>
Fixes#7121
Talos pulls some images on its own (without CRI/kubelet) to the `system`
namespace of the CRI containerd. These images are not visible to the
CRI/kubelet, so we need to clean them up manually.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Rename members to machines to be clearer.
Display the correct member count.
Closessiderolabs/talos#7127.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Add startup probes that probe the containers for 60 seconds before switching to liveness probes.
Closessiderolabs/talos#7054.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Hide kube-apiserver, kube-controller-manager and kube-scheduler statuses on the dashboard for the worker nodes, instead of showing them as n/a.
Also display the cluster name as n/a for workers (instead of an empty string), as that information is not available to them.
Closessiderolabs/talos#7103.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Show the config URL template that will be populated when the code is entered. Closessiderolabs/talos#7092.
Clear the form when the tab is exited & do not display "Saved successfully" message when the code is saved, as we navigate to the summary tab afterward anyway. Closessiderolabs/talos#7093.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Updated documentation, what's new, etc.
Also fix some minor UI issues in the dashboard.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Prevent dashboard from crashing when a dead/non-existent node is specified on `talosctl --nodes`.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Use GRUB quoting function to the kernel args passed to Talos.
This fixes passing `${variable}` to `talos.config=` kernel argument.
Also fix a problem with `ONBUILD` being exected for `imager` image.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The problem was that 'Succeeded' pod was treated as 'not ready', so that
`MachineStatus` never reached readiness state.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#7041
Rework the DHCP flow so that we don't use `INFORM` requests anymore. The
idea is to try requesting a hostname from the DHCP server first, and if
the hostname is not send, or it gets overridden in Talos, restart the
DHCP sequence sending the hostname to the DHCP server.
This still avoids sending and requesting a hostname in one request.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes: https://github.com/siderolabs/talos/issues/7017
Should allow external services to detect which user block devices might
need to be wiped during reset.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Network probes are configured with the specs, and provide their output
as a status.
At the moment only platform code can configure network probes.
If any network probes are configured, they affect network.Status
'Connectivity' flag.
Example, create the probe:
```
talosctl -n 172.20.0.3 meta write 0xa '{"probes": [{"interval": "1s", "tcp": {"endpoint": "google.com:80", "timeout": "10s"}}]}'
```
Watch probe status:
```
$ talosctl -n 172.20.0.3 get probe
NODE NAMESPACE TYPE ID VERSION SUCCESS
172.20.0.3 network ProbeStatus tcp:google.com:80 5 true
```
With failing probes:
```
$ talosctl -n 172.20.0.3 get probe
NODE NAMESPACE TYPE ID VERSION SUCCESS
172.20.0.3 network ProbeStatus tcp:google.com:80 4 true
172.20.0.3 network ProbeStatus tcp:google.com:81 1 false
$ talosctl -n 172.20.0.3 get networkstatus
NODE NAMESPACE TYPE ID VERSION ADDRESS CONNECTIVITY HOSTNAME ETC
172.20.0.3 network NetworkStatus status 5 true true true true
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Implement a screen for entering/managing the config `${code}` variable.
Enable this screen only when the platform is `metal` and there is a `${code}` variable in the `talos.config` kernel cmdline URL query.
Additionally, remove the "Delete" button and its functionality from the network config screen to avoid users accidentally deleting PlatformNetworkConfig parts that are not managed by the dashboard.
Add some tests for the form data parsing on the network config screen.
Remove the unnecessary lock on the summary tab - all updates come from the same goroutine.
Closessiderolabs/talos#6993.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Restart udevd on adding custom rules where in the case the subsystems
needs to be re-triggered.
Fixes: #7001
Signed-off-by: Noel Georgi <git@frezbo.dev>
talosctl netstat -k show all host and non-hostnetwork pods sockets/connections.
talosctl netstat namespace/pod shows sockets/connections of a specific pod +
autocompletes in the shell.
Signed-off-by: Nico Berlee <nico.berlee@on2it.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
* Clear the input form and switch to summary tab after the network config is saved.
* Use nodeaddress resource for detecting and displaying IPs. Improve the IP filtering logic.
* Fix the logic of gateway detection. Display all gateways instead of a single one.
* Use hostnamestatus resource to detect the hostname instead of an API call.
* Add hostname entry to the network info section on summary tab (as `HOST`).
* Enable `OUTBOUND` entry in network info section on summary tab.
* Display only the physical network interfaces in the interface dropdown on network config tab.
* Improve form input handling.
* Additional minor fixes & improvements.
Closessiderolabs/talos#6992.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
This adds support for automatically registering node hostnames in DNS by
sending the current hostname to DHCP via option 12. If the current hostname is
updated, issue a new DISCOVER to propagate the update to DHCP (updating the
hostname on lease renewals is not universally supported by DHCP servers). This
addition maintains the previous functionality where the node can also request
its hostname from the DHCP server. The received hostname will be processed and
prioritized as usual by the `network.HostnameSpecController`.
This change set also contains fixes to make DHCP renewals compliant with RFC
2131, specifically avoiding sending the server identifier and requested IP
address when issuing renewals using a previous offer. This also uncovered
issues and missing features in the upstream `insomniacslk/dhcp` library, the
fixes and improvements for which are now finally merged.
Sending hostname updates have been tested against `dnsmasq` and the built-in
DHCP + DNS services in Windows Server. Hostname retrieval from DHCP and edge
cases with overridden hostnames from different configuration layers have been
extensively tested against `dnsmasq`.
Signed-off-by: Dennis Marttinen <twelho@welho.tech>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Unify getting environment variables, support passing environment
variables via kernel args.
Fixes#6984
See #6999
For META this will be used to pass environment variables to the
installer for ISO images (or PXE booting).
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
New variable value is coming from `META`, and it might be set using the
interactive console (not implemented yet, but it will come soon).
I had to refactor the URL expansion implementation:
* simplify things where possible
* provide more unit-tests for smaller units
* handle expansion of all variables in parallel
* allow parallel expansion on multiple variables
Also I refactored download code to support proper passing of endpoint
function with context.
The end result:
* Talos will try to download config for 3 hours before rebooting
* Each attempt which includes URL expansion + download is limited to 3
minutes
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Implement the network config screen with input forms to configure the initial node networking by writing a config to the META partition.
Closessiderolabs/talos#6961.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
The problem was that `GracefulStop()` will hang forever if there is a
running API call. So if there is a running streaming call, the
maintenance service might hang until it is finished.
The problem shows up with 'Upgrade' API in the maintenance mode if there
is a concurrent streaming API call, e.g.:
1. Watch API is running against maintenance mode.
2. Upgrade API is issued, it tries to run the MaintenanceUpgrade
sequence, which tries to take over the Initialize sequence. The
Initialize sequence is canceled, maintenance API service context is
canceled, but the service doesn't terminate, as it's stuck in
`GracefulStop`. The sequence take over times out, as even the
sequence is canceled, it hasn't terminated yet.
Sample log:
```
[talos] upgrade request received: "ghcr.io/siderolabs/installer:v1.3.3"
[talos] upgrade failed: failed to acquire lock: timeout
[talos] task loadConfig (1/1): failed: failed to receive config via maintenance service: maintenance service failed: context canceled
[talos] phase config (6/7): failed
[talos] initialize sequence: failed
<stuck here>
```
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Bump golangci-lint and fixup new warnings. Ignore check that checks for
used function parameters, it's kind of noisy and makes it confusing to
read interface implementations.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Discovered in #6971. Go compiler cannot deduce proper type on 32bit architectures for those constants,
in `fmt.Print(f)` functions. Since we only compare them with uint32 variables, it makes sense to add proper
types to them.
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The problem showed up on 'reset' of the Talos node which had multiple
endpoints for other control plane nodes, many of which weren't actually
available.
When 'grpc.WithBlock()' is used, etcd will try to dial the first
endpoint and return an error if the dial fails.
Use noblock mode by default with multiple endpoints, and blocking mode
with a single endpoint.
Pass the context to etcd to properly abort dial operations if the
context get canceled.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Instead of doing excessive get/list requests, do a watch per node in an infinite retry.
Additionally, refactor the dashboard code to make the various data listener namings more consistent and reorganize the packages.
Closessiderolabs/talos#6960.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
If link has no `Info` field we can't do anything meaningful, so we'll just log and skip.
Also fix race in test.
For #6956
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fix a data race caused by the metadata field of PlatformNetworkConfig being edited after it was sent to the channel. It caused test failures.
Fix it by setting a copy of the metadata instead.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
A special META key might contain optional platform network config for
the `METAL` platform.
It is completely optional, but if present, it works same way as in the
clouds: it is applied with low priority (can be overridden with machine
config), but provides some initial defaults for the machine.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This allows to put keys to META partition.
META contents can be viewed with `talosctl get metakeys`.
There is not real usecase for it yet, but the next PRs will introduce
two special keys which can be written:
* platform network config for `metal`
* `${code}` variable
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Implement the new summary dashboard with node info and logs.
Replace the previous metrics dashboard with the new dashboard which has multiple screens for node summary, metrics and editing network config.
Port the old metrics dashboard to the tview library and assign it to be a screen in the new dashboard, accessible by F2 key.
Add a new resource, infos.cluster.talos.dev which contains the cluster name and id of a node.
Disable the network config editor screen in the new dashboard until it is fully implemented with its backend.
Closessiderolabs/talos#4790.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Use a global instance, handle loading/saving META in global context.
Deprecate legacy syslinux ADV, provide an easier interface for
consumers.
Expose META as resources.
Fix the bootloader revert process (it was completely broken for quite a
while :sad:).
This is a first step which mostly does preparation work, real changes
will come in the next PRs:
* add APIs to write to META
* consume META keys for platform network config for `metal`
* custom key for URL `${code}`
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6730
`go generate`-based step downloads the upstream manifest, transforms it
to match our requirements, and it is compiled in as the Flannel
manifest.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Launch CoreDNS even if the node is not initialized.
Network is ready already, but CCM didn't finish their job.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes#6817
The original problem wasn't reproducible with `main`, but there was a
set of bugs in the shutdown sequence which was preventing it from
completing successfully, as in the maintenance mode nothing is running
and initialized yet.
Most of the bugs were `nil` pointer dereferences.
Fixed a small issue with final 'RebootError' printed as a failure in the
ACPI shutdown path.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR adds first 12 symbols from container ID and adds them to `talosctl -k containers` each container output.
That way we can ensure that we get the logs from proper container even if there is a newer one.
Closes#6886
Co-authored-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Use new version of go-kubernetes, and move the `kube-proxy` DaemonSet
update to follow common logic of bootstrap manifests update.
This fixes a confusing behavior when after `k8s-upgrade` the version of
`kube-proxy` is not updated in the machine config.
See https://github.com/siderolabs/go-kubernetes/pull/3
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Exposes the pod IP as the `POD_IP` environment variable via the downward
API in the kube-proxy pod for use in e.g. metrics-bind-addr.
Signed-off-by: Tim Jones <tim.jones@siderolabs.com>
This introduces a new role for Talos API which fills the gap between
`os:reader` and `os:admin` roles.
Fixes#6898
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
When removing a member from `etcd`, the server does a pre-check to make
sure the member is connected to a quorum of other members, and the
remove request might fail. Add a retry to wait for the etcd to be fully
connected before giving up, as some parts of the reset flow alrady ran.
Also fix an issue which appears in the integration test, when `reset` is
called early in the boot sequence when local etcd hasn't started fully yet.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
This PR ensures that we can handle third grub option - "wipe". We will use it in 1.4.
For #6842
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Fixes: https://github.com/siderolabs/talos/issues/6815
Additionally, make it possible to run reset in maintenance mode: to
enable a way for resetting system disk and remove all traces of Talos
from it.
The new reset flow works in a separate sequence, changed disk probe
lookup to check the boot partition instead of the ephemeral one.
Signed-off-by: Artem Chernyshev <artem.chernyshev@talos-systems.com>
Move dashboard package into a common location where both Talos and talosctl can use it.
Add support for overriding stdin, stdout, stderr and ctt in process runner.
Create a dashboard service which runs the dashboard on /dev/tty2.
Redirect kernel messages to tty1 and switch to tty2 after starting the dashboard on it.
Related to siderolabs/talos#6841, siderolabs/talos#4791.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
- github.com/aws/aws-sdk-go to v1.44.209
- github.com/stretchr/testify to v1.8.2
- github.com/jsimonetti/rtnetlink to v1.3.1
- google.golang.org/genproto to v0.0.0-20230223222841-637eb2293923
- github.com/emicklei/dot to v1.3.1
- github.com/gdamore/tcell/v2 to v2.6.0
- github.com/insomniacslk/dhcp to v0.0.0-20230220063916-5369909a5de7
- github.com/jsimonetti/rtnetlink to v1.3.1
- github.com/opencontainers/runtime-spec to v1.1.0-rc.1.0.20230215090456-58ec43f9fc39
- github.com/rivo/tview to v0.0.0-20230226195229-47e7db7885b4
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
The `modules.dep` kernel module dependency tree extension root path was
previously created with a permission of `0o700` which means the talos
root go a permission of `0o700` when the kernel module tree was re-built
when extensions providing kernel modules was enabled. This means that
any binaries lost the executable permission when ran as non-root
creating an `EACCES` error. Fix by making sure the temporary directory
created for building kernel modules tree has `0o755` permission
explicitly.
Signed-off-by: Noel Georgi <git@frezbo.dev>
The shared code is going out to the
github.com/siderolabs/go-kubernetes library.
The code will be used in Talos and other projects using same features.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Load & start machined earlier and in initialize sequence, so that it is possible to use its API over its unix socket in maintenance mode.
Additionally, do not return features from Version API if a config is not yet available.
Related to siderolabs/talos#4791.
Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
Can set netmask as number.
Signed-off-by: Serge Logvinov <serge.logvinov@sinextra.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Talos always supported that, but CRI config lacked support for it.
Now with recent containerd the new `_default` host is used as a
fallback, so this re-enables the support and updates the docs.
See https://github.com/containerd/containerd/pull/8065
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Fixes: #6802
Automatically load kernel modules based on hardware info and modules
alias info. udevd would automatically load modules based on HW
information present.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This fixes the issue when the overlay mount target directory was used as
lowerdir for the mount, creating extra folders in the extension.
Fix the issue by adding support for normal overlay mounts to use a
source directory when specified.
Also fixes a small issue where messages was logged when error is nil.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Not sure how I missed it in the first PR, but that's the only character
which was not quoted properly.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Wait for the network before trying to access the metadata service.
Retry the calls when appropriate (most platforms use `download.Download`
function which does proper retries).
Co-authored-by: Noel Georgi <git@frezbo.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
One case was missing: when network section is present, but value is
omitted.
Fixes#6825
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Use process wrapper introduced in #6814 to drop capabilities. This change
also means the capabilities are dropped per process level and not for
PID 1 (machined), which allows us to drop capabilities per process.
Signed-off-by: Noel Georgi <git@frezbo.dev>
Use a wrapper for starting processes which can setup proper cgroups,
OOMscore, and also drop capabilities for the process, then it calls
`execve`.
The containerd tests is also fixed to support cgroups when
running tests in buildkit. It used to pass previously as we did not
error if cgroup setup failed.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This fixes multiple issues:
* `log.Fatalf` in the machined code leads to kernel panic
* return URL if some expansion fails
* correctly handle destroyed event (wait for the next one)
Fixes#6807
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
One of the fields in the GRUB config - boot arguments - contains
user-controlled input. Talos supports variable expansion in
`talos.config` parameter, and uses `${var}` syntax.
In GRUB config, `}` is a special character, and introduction of `}`
breaks config parsing both for GRUB and Talos.
Correctly escape and unescape special characters.
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
The previous `udevd` healthcheck was incomplete and if `udevd` took more
time to startup the initial `udevadm trigger` would have silently failed
failing to setup proper devices. `udevadm trigger` returns an exit code
of zero even if `udevd` is not running. This PR fixes by first checking
if the `udevd` control socket exists, which is a faster check, then
making sure `udevd` is up by running `udevadm control` command. This
ensures that `udevd` is properly initialized before running any `udevadm
trigger` commands even if `udevd` is restarted/killed.
Signed-off-by: Noel Georgi <git@frezbo.dev>
This excludes it out of the `NodeAddress`.
Needs extra testing to confirm that it actually still works as anchor
IP.
Fixes#6760
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>