4085 Commits

Author SHA1 Message Date
Noel Georgi
e3f3f5794d
feat: implement revert for sd-boot
Implement revert for sd-boot.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-22 20:20:31 +05:30
Walt Chen
d8b0903d70
docs: vagrant setup document fix
Fix the incorrect yamlpath of controlplane.yaml to be applied to
controlplane node.

Signed-off-by: Walt Chen <godsarmycy@gmail.com>
2023-06-20 23:48:52 +05:30
Andrey Smirnov
fe0f46980f
feat: implement secure boot from disk
This includes sd-boot handling, EFI variables, etc.

There are some TODOs which need to be addressed to make things smooth.

Install to disk, upgrades work.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-16 20:15:16 +05:30
Dmitriy Matrenichev
445f5ad542
feat: support API server load balancer
This commit adds support for API load balancer. Quick way to enable it is during cluster creation using new `api-server-balancer-port` flag (0 by default - disabled). When enabled all API request will be routed across
cluster control plane endpoints.

Closes #7191

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-06-16 10:09:20 -04:00
Andrey Smirnov
19bc223de8
refactor: bootloader interface, labels
Move labels out of the bootloader interface, while moving copying assets
into the bootloader interface. GRUB is using one set of assets,
`sd-boot` will be using another one.

Fix the problem with `bootloader.Probe()` finding boot partition on the
host when it runs in a priv container, fixing issues with image creation
in the CI.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-14 17:33:11 +04:00
Dmitriy Matrenichev
665702ddd3
chore: fix cilium e2e tests
`WITH_CONFIG_PATCH_WORKER` check result was overriding any value set in `CONFIG_PATCH_FLAG` variable.
Move it to the different variable.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-14 15:08:31 +04:00
Noel Georgi
71a548d180
chore: generic boootloader implementation
This changes the bootloader code to be generic to support
multiple bootloader implementations.

Signed-off-by: Noel Georgi <git@frezbo.dev>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-13 23:36:20 +04:00
Andrey Smirnov
e9dbc9311b
test: bump versions for upgrade tests
As we're getting to 1.5.0, bump versions for upgrade tests.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-13 23:17:22 +04:00
Andrey Smirnov
0a99965efb
refactor: replace uncordonNode with controllers
Fixes #7233

Waiting for node readiness now happens in the `MachineStatus` controller
which won't mark the node as ready until Kubernetes `Node` is ready.

Handling cordoning/uncordining happens with help of additional resource
in `NodeApplyController`.

New controller provides reactive `NodeStatus` resource to see current
status of Kubernetes `Node`.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-13 21:48:42 +04:00
Andrey Smirnov
e858bca3a2
test: fix cilium integration tests
Due to a bug (?) cilium tests don't clean up all the deployments & pods,
leaving one pod in 'Pending' state.

Kubernetes e2e tests check for !Running pods in `kube-system` namespace.

Fix by moving cilium tests to a separate namespace.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-13 20:41:15 +04:00
Andrey Smirnov
455328d058
fix: allow time skew for generated kubeconfig
The `kubeconfig` can be fetched from one Talos node, while Kubernetes
API request might land on `kube-apiserver` on a different node which
might have time slightly out of sync.

The minimum time diff between the two might lead to `Unauthorized` error
on first use:

```
1 authentication.go:70] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-06-13T15:30:51Z is before 2023-06-13T15:30:52Z, verifying certificate SN=314179687645609956480346926163236202072, SKID=, AKID=E9:9E:A8:1E:0B:6C:8B:AB:1B:2B:7E:17:14:CF:A4:0A:82:6B:42:67 failed: x509: certificate has expired or is not yet valid: current time 2023-06-13T15:30:51Z is before 2023-06-13T15:30:52Z]"
```

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-13 20:03:38 +04:00
Noel Georgi
3ae05648ae
fix: usage of custom kernels
This fixes usage of custom kernel images to copy over the modules info
list and the default set of modules shipped with Talos.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-12 22:14:36 +05:30
Andrey Smirnov
0797b0d168
chore: add a pipeline to test cloud-images step without a release
Also uncomment Azure uploader.

Add the Azure environment variables to the Makefile cloud-images step.

Change disk size and tier to 16GiB and tier: P3

Add boolean value to drone pipeline and the cloud images hack will check the value to determine which Azure Compute Gallery to push images to.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>
2023-06-12 19:42:27 +04:00
Utku Ozdemir
e5a36268b6
docs: include allowSchedulingOnControlPlanes on talosctl gen config output
Include a description and a commented-out example for the `cluster.allowSchedulingOnControlPlanes` field on `talosctl gen config ...` output.

Closes siderolabs/talos#7313.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-06-12 16:38:24 +02:00
Dmitriy Matrenichev
c74d937280
chore: bump github.com/cosi-project/runtime
Replace resource.Resource with meta.ResourceWithRD where possible.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-06-12 09:49:08 -04:00
Andrey Smirnov
dbaf5c6997
refactor: task labelControlPlane into controllers
See #7233

The controlplane label is simply injected into existing controller-based
node label flow.

For controlplane taint default NoScheduleTaint, additional controller &
resource was implemented to handle node taints.

This also fixes a problem with `allowSchedulingOnControlPlanes` not
being reactive to config changes - now it is.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-12 15:25:13 +04:00
Nanfei Chen
1865a0c296
chore: modify some usages that are not recommended
Struct MetaValue has methods on both value and pointer receivers. Such usage is not recommended by the Go Documentation. Modifies the receiver usage.

Variable config collides with imported package name. Renames the variable config.

Removes a redundant alias.

Empty slice declaration uses a literal. Replaces with nil slice declaration.

Signed-off-by: Nanfei Chen <chennanfei@yeah.net>
Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-06-11 17:40:26 -04:00
Dmitriy Matrenichev
3816318b9e
chore: wrap config.Provider in atomic wrapper
Because `SetConfig` can be called concurrently with `Config` there is risk of data race, if something goes wrong. Since `config.Provider` is an interface type, it means its size is two machine words. And so in very unpleasant situations it can lead to arbitrary RCE, because interface variable can be in partially updated state.

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-06-09 15:05:39 -04:00
Nanfei Chen
d04cf19788
chore: clean up unnecessary self assignment
There is no need that the value of err is assigned to itself. So removes this self assignment.

Signed-off-by: Nanfei Chen <chennanfei@yeah.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-09 16:51:58 +04:00
Noel Georgi
a34a948985
fix: copy missing modules.* files
Copy missing `modules.order`, `modules.builtin` and
`modules.builtin.modinfo` files so tools can read them.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-09 17:25:28 +05:30
Andrey Smirnov
f5e3272fce
refactor: task 'updateBootLoader' as controller
Fixes #7232

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-09 15:27:48 +04:00
Andrey Smirnov
e7be6ee7c3
refactor: make event log streaming fully reactive
I ended up completely rewriting the controller, simplifying the flow
(somewhat) so that there's just a single control flow in the controller,
while reading from v1alpha1 events is converted to reading from a
channel.

Fixes #7227

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-08 23:13:33 +04:00
Noel Georgi
aef2192a65
chore: use fixed module list
Use a fixed list of modules to copy into Talos initramfs.

This makes sure we can still enable thing in Talos kernel as modules but
not ship it as default in Talos (extra modules could be extensions).

Also fixes: #7341

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-09 00:08:42 +05:30
Andrey Smirnov
c719aa2316
fix: allow http:// for discovery service URL
Fixes #7333

Also fixed the discovery service controller to reconnect the client on
config changes (previously it wasn't reactive on e.g. URL changes).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-08 20:28:12 +04:00
Andrey Smirnov
39134d8d53
chore: fix cron pipeline
The step was added to integration pipeline, but I forgot to add it to
cron pipeline.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-08 17:34:19 +04:00
Andrey Smirnov
a61dcdbbd5
fix: don't load RDMA over Ethernet driver by default
Document the change for upgrade notes.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-08 16:28:32 +04:00
Andrey Smirnov
aac441f618
chore: update Go to 1.20.5, bump dependencies
Go dependencies, new pkgs, extras, etc.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-07 23:40:59 +04:00
Noel Georgi
1c0c7933df
chore: cleanup partition code
Cleanup partition code to be explicit about `Format` and `Partition`
options.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-08 00:35:09 +05:30
bdronneau
31b988281e
docs: add some words about certifcates
Certificates, CA, expiration.

Signed-off-by: bdronneau <basti1.dr@gmail.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-07 00:10:39 +04:00
Noel Georgi
e912c0dfcf
chore: use go-blockdevice for zeroing partitions
Use the `go-blockdevice` library to zero partitions.

Also added a test that writes `ones` to the partition and verifies its
zeroes after zeroing it.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-07 01:12:11 +05:30
Christian Rolland
e6dde8ffc5
feat: add network chaos to qemu development environment
Add flags for configuring the qemu bridge interface with chaos options:
- network-chaos-enabled
- network-jitter
- network-latency
- network-packet-loss
- network-packet-reorder
- network-packet-corrupt
- network-bandwidth

These flags are used in /pkg/provision/providers/vm/network.go at the end of the CreateNetwork function to first see if the network-chaos-enabled flag is set, and then check if bandwidth is set.  This will allow developers to simulate clusters having a degraded WAN connection in the development environment and testing pipelines.

If bandwidth is not set, it will then enable the other options.
- Note that if bandwidth is set, the other options such as jitter, latency, packet loss, reordering and corruption will not be used.  This is for two reasons:
	- Restriction the bandwidth can often intoduce many of the other issues being set by the other options.
	- Setting the bandwidth uses a separate queuing discipline (Token Bucket Filter) from the other options (Network Emulator) and requires a much more complex configuration using a Heirarchial Token Bucket Filter which cannot be configured at a granular enough level using the vishvananda/netlink library.

Adding both queuing disciplines to the same interface may be an option to look into in the future, but would take more extensive testing and control over many more variables which I believe is out of the scope of this PR.  It is also possible to add custom profiles, but will also take more research to develop common scenarios which combine different options in a realistic manner.

Signed-off-by: Christian Rolland <christian.rolland@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-06 20:15:26 +04:00
Noel Georgi
47986cb79e
chore: unify kexec phase
This changes the mounting/unmounting of `BOOT` partiton code into
`kexecPrepare` phase. Also skips if `BOOT` partition cannot be found.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-06 20:30:59 +05:30
Noel Georgi
3a865370f5
feat: qemu secureboot
Add qemu support for secureboot testing via `talosctl cluster create`.

Can be tested via:

```bash
sudo -E _out/talosctl-linux-amd64 cluster create --provisioner=qemu $REGISTRY_MIRROR_FLAGS --controlplanes=1 --workers=1 --iso-path=_out/talos-uki-amd64.iso --with-secureboot=true --with-tpm2=true --skip-injecting-config --with-apply-config
```

This currently only supports just booting Talos in SecureBoot mode.
Installation and Upgrade comes as extra PRs.

Fixes: #7324

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-06 19:20:07 +05:30
Andrey Smirnov
5dab45e869
refactor: allow kmsg log streaming to be reconfigured on the fly
Fixes #7226

This follows same flow as other similar changes - split out logging
configuration as a separate resource, source it for now in the cmdline.

Rewrite the controller to allow multiple log outputs, add send retries.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-06 15:56:24 +04:00
Dmitriy Matrenichev
8a02ecd4cb
chore: add endpoints balancer controller
This PR adds support for creating a list of API endpoints (each is pair of host and port).

It gets them from
- Machine config cluster endpoint.
- Localhost with LocalAPIServerPort if machine is control panel.
- netip.Addr[0] and port from affiliates if they are control panels.

For #7191

Signed-off-by: Dmitriy Matrenichev <dmitry.matrenichev@siderolabs.com>
2023-06-05 20:47:52 -04:00
Noel Georgi
423a31ac9d
chore: deprectae bootloader installer option
Deprecate the `bootloader` installer option. This has not been used in a
long while.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-05 23:21:03 +05:30
Andrey Smirnov
cdfece7d64
chore: optimize image compression
Use `pigz` (parallel) instead of `gzip`.

Use `xz` compression `-0` instead of `-6`.

This has pros and cons:

* image size goes up (77M -> 79M) (+2.5%)
* image generation goes down 19s -> 10s (-50%).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-05 18:55:56 +04:00
Noel Georgi
bfc3419376
chore: add default console args
Add default console args for UKI iso.

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-05 19:43:20 +05:30
Andrey Smirnov
2749aeeda0
feat: add support for multi-doc strategic merge patching
Other changes:

* renamed `version:` to `apiVersion:` in the multi-doc format, as this
  better matches Kubernetes objects
* introduced (not used at the moment) a concept of `NamedDocuments`
  (many documents for the same type)
* added container validation on not having duplicate documents
* JSON6902 now denies multi-doc config

Fixes #7312

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-05 14:31:48 +04:00
Noel Georgi
3f68485e44
feat: add uki iso generation
This adds code to generate a UKI ISO (UEFI only).

Fixes: #7261

Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-06-02 22:44:27 +05:30
Andrey Smirnov
bab484a405
feat: use stable network interface names
Use `udevd` rules to create stable interface names.

Link controllers should wait for `udevd` to settle down, otherwise link
rename will fail (interface should not be UP).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-06-01 21:29:12 +04:00
Utku Ozdemir
196dfb99b0
fix: do not probe kernel args in dashboard if not needed
If the dashboard is run without the "Config URL" screen, do not initialize it, and do not probe the kernel args for the code parameter.

Refactor the dashboard to do not construct the unused screens at all.

Closes siderolabs/talos#7300.

Signed-off-by: Utku Ozdemir <utku.ozdemir@siderolabs.com>
2023-06-01 10:32:43 +02:00
Andrey Smirnov
8c071b5796
fix: skip DHCP RENEW if server IP in the lease is all zeroes
RENEW packets are sent unicast, so Talos needs the address of the DHCP
server to send RENEW packets to.

Fixes #7211

Fixes #7263

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 22:51:05 +04:00
Andrey Smirnov
badbc51e63
refactor: rewrite code to include preliminary support for multi-doc
`config.Container` implements a multi-doc container which implements
both `Container` interface (encoding, validation, etc.), and `Conifg`
interface (accessing parts of the config).

Refactor `generate` and `bundle` packages to support multi-doc, and
provide backwards compatibility.

Implement a first (mostly example) machine config document for
SideroLink API URL.

Many places don't properly support multi-doc yet (e.g. config patches).

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 18:38:05 +04:00
Alex Lubbock
ecce29dee9
fix: upgrade-k8s use internal IP first, external IP fallback
Currently, upgrade-k8s adds both node internal and external IPs.
This commit uses the internal IP if available; external IP is
only used as a fallback.

Signed-off-by: Alex Lubbock <code@alexlubbock.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 18:21:27 +04:00
Andrey Smirnov
3c64a5ffba
chore: optimize image generation time
Use `pigz` and `--sparse` to handle more efficiently compression of the
assets.

Also move tasks out of `setup-ci` step, as it runs always, including for
the promoted pipelines.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 17:47:23 +04:00
Eirik Askheim
2292f36d97
chore: registry.k8s.io for coredns image
Replace docker.io with registry.k8s.io for the coredns image.

Signed-off-by: Eirik Askheim <eirik@x13.no>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 17:06:13 +04:00
Alex Corcoles
f2b258b373
docs: document talosctl version for upgrades
Use same version as running cluster.

Signed-off-by: Alex Corcoles <alex@corcoles.net>
Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-31 16:29:53 +04:00
Andrey Smirnov
a0773f783c
chore: add ukify Go script
This is a port of ukify.py and systemd-measure from systemd.

This requires no actual TPM to be present to calculate the PCR
signatures.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
Signed-off-by: Noel Georgi <git@frezbo.dev>
2023-05-30 23:33:26 +05:30
Andrey Smirnov
b69e38d1ff
chore: bump dependencies
New pkgs, Linux 6.1.30, Flannel 0.22.0, Go modules.

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
2023-05-29 23:19:44 +04:00