Commit Graph

2334 Commits

Author SHA1 Message Date
Alexey Palazhchenko
37a5edf04a feat: update Kubernetes to 1.21.0 release
See CHANGELOG:
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md

Closes #3329.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-09 20:08:20 +03:00
Alexey Palazhchenko
30f687b417 fix: document HDMI problem on RPi 4
Closes #3414.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-08 14:06:12 -07:00
Alexey Palazhchenko
29da22d063 feat: add config validation warnings
Closes #3412.
Refs #3413.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-08 13:49:58 -07:00
Andrey Smirnov
eee7ad13aa release(v0.10.0-alpha.2): prepare release
This is the official v0.10.0-alpha.2 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 13:03:50 -07:00
Andrey Smirnov
e0650218a6 feat: support etcd recovery from snapshot on bootstrap
When Talos `controlplane` node is waiting for a bootstrap, `etcd`
contents can be recovered from a snapshot created with
`talosctl etcd snapshot` on a healthy cluster.

Bootstrap process goes same way as before, but the etcd data directory
is recovered from the snapshot.

This flow enables disaster recovery for the control plane: given that
periodic backups are available, destroy control plane nodes, re-create
them with the same config, and bootstrap one node with the saved
snapshot to recover etcd state at the time of the snapshot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-08 10:15:37 -07:00
Artem Chernyshev
247bd50e05 docs: describe steps to install and boot Talos from the SSD on rockpi4
Describe that gross flow while I still remember it.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-04-07 13:06:58 -07:00
Spencer Smith
e6b4e524ff test: update CAPA to 0.6.4
This PR pulls in an updated cluster api aws version, ensuring the CRDs
are closer to what's expected when we patch the CAPA image later in the
setup. We will eventually move to 0.6.5 as soon as it's cut.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-04-07 14:37:20 -04:00
Andrey Smirnov
28753f6dcb fix: trim endpoints/nodes from arguments in talosctl config
When copy-pasting extra space might be added around an argument to the
`talosctl config endpoints/nodes`, which breaks the config as the
endpoint doesn't parse anymore as IP address.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-07 11:37:02 -07:00
Alexey Palazhchenko
aca63b8829 docs: fix "DigitalOcean" spelling
Refs #3427.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-07 09:13:24 -07:00
Andrey Smirnov
33035901ff fix: revert mark PMBR EFI partition as bootable
See talos-systems/go-blockdevice#34 talos-systems/talos#3440

That change broke UEFI boot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-07 07:24:58 -07:00
Andrey Smirnov
fbfd1eb2b1 refactor: pull new version of os-runtime, update code
This is mostly refactoring to adapt to the new APIs.

There are some small changes which are not user-visible immediately (but
visible when using `talosctl get` to inspect low-level details):

* `extras` namespace is removed, it was a hack to distinguish extra and
system manifests
* `Manifests` are managed by two controllers as shared outputs, stored
in the `controlplane` namespace now
* `talosctl inspect dependencies` output got slightly changed
* resources now have `md.owner` set to the controller name which manages
the resource

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-07 06:55:09 -07:00
Alexey Palazhchenko
8737ea716a feat: allow external cloud provides configration
Closes #3312.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-06 22:54:24 -07:00
Andrey Smirnov
3909e2d011 chore: update Go to 1.16.3
See talos-systems/tools#134 talos-systems/pkgs#260
talos-systems/extras#16

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-06 13:53:53 -07:00
Andrey Smirnov
690eb20e97 chore: update blockdevice library for PMBR bootable fix
See https://github.com/talos-systems/go-blockdevice/pull/33

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-06 06:14:56 -07:00
Andrey Smirnov
a8761b8e1e fix: require leader on etcd member operations
This fix is not obvious on whether we need it actually or not, but what
I've seen in the tests seems to be around the fact that added member is
not visible in the member list fetched after the add command succeeds.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-06 05:36:45 -07:00
Alexey Palazhchenko
3dc84625cb fix: make both HDMI ports work on RPi 4
Closes #3414.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-04-05 15:25:39 -07:00
Andrey Smirnov
bd5ae1e0b5 fix: add a check for overlay mounts in installer pre-flight checks
Overlay mount in `mountinfo` don't show up as mounts for any particular
block device, so the existing check doesn't catch them.

This was discovered as our current master can't upgrade because of
overlay mount for `/opt` and `apid` image in `/opt/apid` (which will be
fixed in a separate PR).

Without the check, installer fails on resetting partition table for the
disk effectively wiping the node (`device or resource busy` error).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-05 14:29:46 -07:00
Andrey Smirnov
df8649cbe6 refactor: download modules before go generate
This moves things around a bit so that `go generate` is called after
modules are generated, as `go generate` downloads modules as well.
This fixes a race condition which might show up randomly.

Spotted by: @AlekSi

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-05 11:38:40 -07:00
Andrey Smirnov
39ae0415e9 chore: bump dependencies via dependabot
See #3431 #3432 #3433 #3434

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-05 06:16:24 -07:00
Artem Chernyshev
e16d6d3468 fix: publish rockpi4 image to release artifacts
Attempt #2. Forgot to add it to .drone.jsonnet also 🤦

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-04-03 18:20:54 -07:00
Artem Chernyshev
39c6dbcc7a feat: add --config-patch parameter to talosctl gen config
Fixes: https://github.com/talos-systems/talos/issues/3410

Same as in `talosctl cluster create`. Will apply RFC6902 json patch
during the config generation if specified.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-04-02 10:56:41 -07:00
Andrey Smirnov
e664362cec feat: add API and command to save etcd snapshot (backup)
This adds a simple API and `talosctl etcd snapshot` command to stream
snapshot of etcd from one of the control plane nodes to the local file.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 09:20:16 -07:00
Andrey Smirnov
61b694b948 fix: create rootfs for system services via /system tmpfs
Container rootfs should be writeable as containerd mounts standard
filesystems `/proc` et al.

When `/opt` was used as a root of container filesystem this results in a
problem: Talos overlay mounts `/opt` on `/var/system` which means that
as long as `apid` running `/var` can't be unmounted which breaks
upgrades.

So instead use `/system/libexec` as rootfs for the containers, `/system`
is `tmpfs`, and bind-mount actually executable (`/sbin/init`, machined)
into rootfs.

This fixes upgrades for 0.10.

See also #3425

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 06:37:29 -07:00
Andrey Smirnov
abc2e17ebb test: update 0.9.x version in upgrade tests to 0.9.1
Version 0.9.1 contains a fix for concurrent map write on unmount which
was frequently breaking our upgrade tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 03:59:36 -07:00
Andrey Smirnov
a1e6415403 fix: retry Kubernetes API errors on cordon/uncordon/etc
This extracts function which was used in upgrade/convert flows to retry
transient errors to the main `kubernetes` package, expands it to ignore
timeout errors, and it is now used to retry errors where applicable in
`pkg/kubernetes`.

Fixes #3403

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 03:51:40 -07:00
Andrey Smirnov
063d1abe9c fix: print task failure error immediately
The way processing works is that errors are not printed in the
sequencer, but something which called the sequencer prints the error,
but this means that for fatal failures say in 'upgrade' sequence error
message is printed by machined after the `apid` is stopped.

This means that error won't be visible via `talosctl dmesg`, but only in
serial console.

This changes the flow to print the task error as soon as task fails, and
removes 'done' messages in the sequencer if sequence/phase/task fails
(as otherwise it has both 'done' and 'failed' message which is
confusing).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-02 03:08:26 -07:00
Andrey Smirnov
e039172eda fix: ignore EOF errors from Kubernetes API when converting control plane
During the conversion process, API server goes down, so we can see lots
of network errors including EOF.

Fixes #3404

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-01 10:52:44 -07:00
Branden Cash
7bcb91a433 docs: fix typo for stage flag
docs mentioned `--staged` flag, but should be `--stage`

Signed-off-by: Branden Cash <ammmze@gmail.com>
2021-04-01 10:44:46 -07:00
Andrey Smirnov
a43acb2150 feat: bring in Linux 5.10.27, support for 32-bit time syscalls
This provides binary compatibility for really old binaries using 32-bit
time.

See also: talos-systems/pkgs#259

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-04-01 08:21:37 -07:00
Andrey Smirnov
e2bb5973da release(v0.10.0-alpha.1): prepare release
This is the official v0.10.0-alpha.1 release.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 23:17:31 +03:00
Andrey Smirnov
8309312a3d chore: build components with race detector enabled in dev mode
This provides a variable to build core Talos components with race
detector enabled: `make initramfs WITH_RACE=yes`.

Also refactored and DRYed up the build code exposing common build/link
flags via the Makefile.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 10:55:50 -07:00
Andrey Smirnov
7d91258475 test: fix data race in apply config tests
Variable `chanErr` was read before waiting for the goroutine to finish.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 10:46:50 -07:00
Andrey Smirnov
204caf8eb9 test: fix apply-config integration test, bump clusterctl version
Tests for ApplyConfig API were relying on not really supported behavior
of modifying config via the `Provider` interface (and it was "fixed" in
another PR which cleans up such access to the configuration).

Cluster version bumped to try to workaround strange CAPI bootstrap
failures in e2e-capi.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-31 09:55:53 -07:00
Artem Chernyshev
d812099df3 fix: address several issues in TUI installer
- Table row selection was 1 element off, so disk selector wasn't quite
working.
- Reduce amount of interfaces on the last screen: show only ones that
have physical addresses (changing some settings for lo0 for example was
 making TUI generate incorrect configs)

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-30 19:00:12 -07:00
Andrey Smirnov
269c9ad098 fix: don't write to config object on access
This avoids data race on config access: config object might be accessed
concurrently and it should be read-only on access.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-30 10:38:02 -07:00
Alexey Palazhchenko
a9451f5712 feat: update Kubernetes to 1.21.0-beta.1
See CHANGELOG:
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md

Refs #3329.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-30 03:07:03 -07:00
Artem Chernyshev
4b42ced4c2 feat: add ability to disable comments in talosctl gen config
Fixes: https://github.com/talos-systems/talos/issues/3384

Instead of doing simple `--no-comments` flag, decided to use more
granular approach which allows to either disable examples, or docstring,
or both.

Thus the command looks like this:

```bash
talosctl gen config --with-docs=false --with-examples=false <...>
```

Both are enabled by default to provide better UX for users learning
Talos.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-29 10:52:14 -07:00
Andrey Smirnov
a0dcfc3d52 fix: workaround race in containerd runner with stdin pipe
Containerd API to pass stdin to the container is far from being perfect,
but it seems to contain a race condition we can't avoid: if `NewTask()`
fails, it starts the I/O loop in a goroutine, but never stops it. We
can't stop it as well, as `NewTask()` failed, so to workaround this
failure, copy the stdin into new reader on each access.

This copying shouldn't be a big deal for us, as it's just machine
configuration and it's tiny.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 10:04:50 -07:00
Andrey Smirnov
2ea20f598a feat: replace timed with time sync controller
This is a complete rewrite of time sync process.

Now the time sync process starts early at boot time, and it adapts to
configuration changes:

* before config is available, `pool.ntp.org` is used
* once config is available, configured time servers are used

Controller updates same time sync resource as other controllers had
dependency on, so they have a chance to wait for the time sync event.

Talos services which depend on time now wait on same resource instead of
waiting on timed health.

New features:

* time sync now sticks to the particular time server unless there's an
error from that server, and server is changed in that case, this
improves time sync accuracy

* time sync acts on config changes immediately, so it's possible to
reconfigure time sync at any time

* there's a new 'epoch' field in time sync resources which allows
time-dependent controllers to regenerate certs when there's a big enough
jump in time

Features to implement later:

* apid shouldn't depend on timed, it should be started early and it
should regenerate certs on time jump

* trustd should be updated in same way

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 09:29:43 -07:00
Andrey Smirnov
c38a161ade test: add unit-test for machine config validation
Follow-up for #3383

I added couple of first tests, we should add more as we go through this
code. Even with those tests, I found and fixed two more panics.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 07:32:06 -07:00
Andrey Smirnov
a6106815b7 chore: bump dependencies via dependabot
See #3386 #3387 #3388

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-29 06:38:55 -07:00
Alexey Palazhchenko
35598f391d chore: refactor: extract ClusterConfig
Extract ClusterConfig and related types.
Make one huge file a bit smaller.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-29 05:49:51 -07:00
Artem Chernyshev
032851844f fix: get rid of data race in encoder and fix concurrent map access
Fixes: https://github.com/talos-systems/talos/issues/3377, https://github.com/talos-systems/talos/issues/3380

Fixed the data race in the encoder documentation examples by using `sync.Once`.
We only need to generate them once anyways and then it's not a big deal
that we are using the same pointers everywhere as they're pretty much
constant.

As of `system.go`, looks like we actually have concurrent operations for
partitions unmount so I just added a mutex there.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2021-03-29 01:00:46 -07:00
Andrey Smirnov
4b3580aa57 fix: prevent panic in validate config if machine.install is missing
Fixes #3382

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-26 15:47:07 -07:00
Alexey Palazhchenko
d7e9f6d6a8 chore: build integration tests with -race
Refs https://github.com/talos-systems/talos/issues/3378.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-26 10:08:12 -07:00
Alexey Palazhchenko
9f7d67ac71 chore: fix typo
Actually share golangci-lint cache.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-25 15:14:30 -07:00
Andrey Smirnov
672c970739 fix: allow convert-k8s --remove-initialized-keys with K8s cp is down
The command `--remove-initialized-key` is the last resort to convert
control plane when control plane is down for whatever reason, so it
should work when control plane is not available.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2021-03-25 14:06:08 -07:00
Alexey Palazhchenko
fb605a0fc5 chore: tweak nolintlint settings
Copy from kres manually for now.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-25 13:56:16 -07:00
Alexey Palazhchenko
1f5a0c4065 fix: resolve the issue with Kubernetes upgrade
Add missing cases, refactoring.

Signed-off-by: Alexey Palazhchenko <alexey.palazhchenko@gmail.com>
2021-03-25 12:48:28 -07:00
Spencer Smith
74b2b5578c docs: update AWS docs to ensure instances are tagged
This PR updates our AWS docs so that we specify a tag when creating
instances. This makes it easier to know which VMs were created as part
of this process, as well as quickly spot the init node.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2021-03-25 11:55:19 -04:00