1670 Commits

Author SHA1 Message Date
Andrew Rynhard
568e398578 release(v0.6.0-alpha.6): prepare release
This is the official v0.6.0-alpha.6 release.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 15:19:50 -07:00
Andrew Rynhard
fbf3fd5304 chore: set default CIDRs
This reverts some temporary changes we made to get around some CI
issues.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 14:24:08 -07:00
Andrew Rynhard
1fdbcba763 fix: log interface on validation error
Print the failing interface name if addressing check errors.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 14:22:18 -07:00
Andrey Smirnov
564111d9d5 chore: use outer docker as buildkit instance
This should provide caching for the builds.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 13:55:29 -07:00
Andrey Smirnov
ca1972e708 fix: skip removing CRI state when doing upgrade with preserve
For single-node upgrades if we drop the CRI state, API server won't
start after reboot and this breaks the control plane.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 13:19:35 -07:00
Andrey Smirnov
3b42f56f43 fix: skip vmware platform for !amd64
This fixes the build on arm64.

The fix itself is part of PR #2156.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 12:54:04 -07:00
Andrey Smirnov
b5b70ec858 chore: upgrade pkgs and tools for Go 1.14.6
This also brings in multi-arch pkgs and tools, but we're not consuming
arm64 images yet.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 12:33:53 -07:00
Andrew Rynhard
1f31d24e55 chore: use Kubernetes pipelines
This moves to using Kubernetes pipelines.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 12:09:53 -07:00
Andrey Smirnov
c85608b8d9 test: add an option to bind docker to specific host IP
This allows to override default `0.0.0.0` (`*`) to a specific IP to
avoid conflicts.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 21:13:28 +03:00
Andrey Smirnov
13c0052a6c test: fix racy test ReaderNoFollow
Due to the race between `Read()` and context cancellation, error might
be returned which we can safely ignore.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-27 21:13:11 +03:00
Andrey Smirnov
3d8418a689 feat: force nodes to be set in talosctl commands using the API
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.

Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).

Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.

The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 12:17:43 -07:00
Andrey Smirnov
f23c9111d1 feat: upgrade etcd to 3.3.22 version
Latest version in 3.3 branch is 3.3.23, but it's broken, so we use previous
stable version.

Switch to official etcd gcr.io registry, early support for arm64.

Move `etcd` service to run in system containerd.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 09:44:43 -07:00
Andrey Smirnov
70a65cbb01 feat: make partitions on additional disk without size occupy full disk
Fixes #2214

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 07:33:07 -07:00
Andrey Smirnov
3934c78c94 fix: log messages properly when sequence/phase/task fails
Previously log said `done` for the failed items which seemed not obvious
when combined with the error message later on.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 06:43:03 -07:00
Andrey Smirnov
74413b1393 fix: ignore sequence lock errors in machined
This prevents reboots when some actions triggers sequence while another
sequence is still running.

Fixes #2209

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 14:36:06 -07:00
dependabot[bot]
0aae950518 chore: bump lodash from 4.17.15 to 4.17.19 in /docs/website
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.15 to 4.17.19.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.15...4.17.19)

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 11:17:18 -07:00
Artem Chernyshev
c70c08c8ce chore: extract loadbalancer, network, crashdup and process from firecracker
Second part of refactoring to split common logic for VM provisioners
from Firecracker provisioner.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-20 11:03:03 -07:00
Andrey Smirnov
4cd6e7e200 refactor: use humanize.Bytes everywhere
This removes dependency on `bytefmt` package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 07:26:33 -07:00
Andrey Smirnov
f047c42ae7 test: provider correct installer kernel args for firecracker
Firecracker never executes the bootloader, so kernel args passed to the
installer aren't used, but if the same disk image is used to boot Talos
e.g. in `qemu`, it fails to set up console properly for example.

This PR simply provides those kernel args to the installer so that
they're persisted in the image.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 16:52:08 +03:00
Artem Chernyshev
19cd46459b chore: initial extraction of base vm provisioner
Created base provisioner struct for all VM based provisioners.
Moved state.go and reflect.go to the common module.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 15:45:54 -07:00
steverfrancis
8dd81b0693 docs: use latest talosctl download link
Update download example to reference latest release.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-18 14:45:52 -07:00
Artem Chernyshev
3d25ceb13e chore: move inmemhttp from firecracker provisioner to internal/pkg/
To be reused in qemu provisioner later on.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 07:11:50 -07:00
Andrey Smirnov
ad99cb6421 feat: implement talosctl dashboard command
This builds a simple CLI UI for Talos cluster monitoring.

Some new APIs were added for monitoring based on Prometheus procfs
package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 14:24:04 -07:00
Andrey Smirnov
56467af6a5 fix: wrap errors in upgrade API handler
We often see in the logs errors like:

```
machined Unknown [/machine.MachineService/Upgrade] 6.984959636s unary Unauthorized (:authority=unix:/run/system/machined/machine.sock;content-type=application/grpc;proxyfrom=172.21.0.2,172.21.0.3,172.21.0.4;user-agent=grpc-go/1.26.0)
```

```
* 172.21.0.4: rpc error: code = Unknown desc = Unauthorized
```

These errors are not related to the API handling, but most probably
coming from one of the actions performed by the handler. Wrap the errors
to get better debugging output.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 13:17:40 -07:00
Andrey Smirnov
2f4fb34baf fix: update container name in docker crashdump
Small bug resulted in container names being cut in the wrong way.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:49:29 -07:00
Andrey Smirnov
1a0e1bc393 chore: update module dependencies
Fixes #2316

Simply update dependencies we don't track on version level to be
compatible with Talos components (like etcd or k8s).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:00:50 -07:00
Andrey Smirnov
41d5f7859a chore: update golangci-lint to 1.28.3
Fixes #2272

`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.

This contains fixes from:

* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 08:05:42 -07:00
Andrey Smirnov
e82895ccc5 chore: upgrade Go to 1.14.5
go1.14.5 (released 2020/07/14) includes security fixes to the
crypto/x509 and net/http packages.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 07:05:54 -07:00
Spencer Smith
f290f88160 chore: update clusterctl for CI testing
This PR brings in the latest version of clusterctl that has built-in
support for the talos repos. I'll be chasing this with a move to using
the control-plane provider as well!

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 19:33:59 -04:00
Andrey Smirnov
c54639e541 feat: implement server-side API for cluster health checks
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.

Summary of changes:

* new `cluster` API

* `apid` now listens without auth on local file socket

* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate

* `talosctl health` by default now does server-side health check

UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-15 13:52:13 -07:00
Spencer Smith
7d10677ee8 docs: update worker creation flags for azure docs
This PR updates the worker flags for azure. Fixes an issue where, if you
have multiple subnets and the talos one isn't default, the workers and
control plane nodes came up on different subnets. Requires updating the
firewalls if they don't come up in the same subnet, so this is better
UX.

Also added a note that azure support is broken in v0.5.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 12:03:33 -07:00
Andrew Rynhard
0617a10027 feat: upgrade Kubernetes to v1.19.0-rc.0
This brings in the latest version of Kubernetes.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-14 13:07:18 -07:00
Andrew Rynhard
fd62d457c3 release(v0.6.0-alpha.5): prepare release
This is the official v0.6.0-alpha.5 release.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-13 13:51:43 -07:00
Andrew Rynhard
d2fc210684 chore: update meeting links
This updates the office hours and contributors meetings links.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-13 12:50:17 -07:00
Andrey Smirnov
cbb7ca8390 refactor: merge osd into machined
This merges `osd` API into `machined`. API was copied from `osd` into
`machined`, and `osd` API was deprecated.

For backwards compatibility, `machined` still implements `osd` API, so
older Talos API clients can still talk to the node without changes.

Docs were updated. No functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-13 12:50:00 -07:00
Andrey Smirnov
19343e5f1a test: workaround famous flaky Containerd.RunTwice test
I'm not sure if this is going to stop it, but feels like we're hitting
some race condition in containerd itself, so attempt to sleep a bit in
between the container launches to avoid the errors.

```
=== RUN   TestContainerdSuite/TestRunTwice
2020/07/10 18:57:24 state Running: Started task 036de7a9-f667-48e8-905f-216233980f94 (PID 4964) for container 036de7a9-f667-48e8-905f-216233980f94
    TestContainerdSuite/TestRunTwice: containerd_test.go:179:
        	Error Trace:	containerd_test.go:179
        	Error:      	Received unexpected error:
        	            	failed to create task: "036de7a9-f667-48e8-905f-216233980f94": dial unix \00/containerd-shim/f88fff9fe9795db4229846b09b2da816f6bd981b8112345486ff3b5653892920.sock: connect: connection refused: unknown
        	Test:       	TestContainerdSuite/TestRunTwice
--- FAIL: TestContainerdSuite (6.62s)
    --- PASS: TestContainerdSuite/TestContainerCleanup (0.92s)
    --- PASS: TestContainerdSuite/TestImportFail (0.54s)
    --- PASS: TestContainerdSuite/TestImportSuccess (0.45s)
    --- PASS: TestContainerdSuite/TestRunLogs (0.26s)
    --- PASS: TestContainerdSuite/TestRunSuccess (0.31s)
    --- FAIL: TestContainerdSuite/TestRunTwice (0.39s)
    --- PASS: TestContainerdSuite/TestStopFailingAndRestarting (1.61s)
    --- PASS: TestContainerdSuite/TestStopSigKill (0.95s)
FAIL
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 13:33:18 -07:00
Andrey Smirnov
2144b6a099 feat: add names to tasks and phases
For each task, name follow function name for now (but it could be
customized if needed).

Phases are after main theme of the tasks inside the phase.

Task and phase events are now displayed in `talosctl events`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 13:09:50 -07:00
Artem Chernyshev
8fc352ec4f feat: merge mode in talosctl kubeconfig
New flag `-m` will enable merge mechanism in `talosctl kubeconfig`

Command examples:

```
talosctl kubeconfig -m

talosctl kubeconfig -m ~/.kube/config
```

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-10 12:39:30 -07:00
Andrey Smirnov
bdd011be30 test: update events test with more flow control
Some of the event tests rely on publishing more events than buffer
capacity, so consumers should be able to keep up with the producer to
avoid hitting buffer overrun.

Producer was rate-limited already, now adding rate limiting for consumer
with higher burst, it should allow consumers to reach blocked state, but
syncronizes them a bit with each other.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 12:14:57 -07:00
Andrey Smirnov
9590030a84 feat: print crash dump in talosctl cluster create on failure
When cluster fails to be bootstrapped or it fails the health check, it's
hard to find the root cause without the logs.

This change adds optional crashdump (it dumps firecracker logs or docker
logs) after provisioning failure. It's not enabled by default.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 11:54:07 -07:00
Andrey Smirnov
60155c8048 test: update tests for pkg/follow to be less time-dependent
These tests rely a lot on `inotify` and interaction with the kernel. We
try to make them less dependent on wall time and performance by using
size hints and timeouts.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 21:00:43 +03:00
Andrey Smirnov
931237b23c test: update init node check in reset API tests
Previously we assumed that node 0 is the init node, and it can't be
reset. With new bootstrap API approach, there's no init node, and all
the nodes can be reset. This corrects the check to skip only the init
node, and with bootstrap API there's no init node (so no nodes are
skipped).

Fixes #2277

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 10:48:14 -07:00
Andrey Smirnov
804f162756 fix: improve node uncordon tasks
1. Increase retry timeout.

2. Use timeout per attempt.

3. Check for node readiness as a gate to succeed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 09:26:47 -07:00
Andrey Smirnov
a4a2a3c83a feat: uncordon nodes automatically on boot
Talos will mark node as schedulable if it was previously cordoned by
Talos (for upgrade, reset, etc.)

If user marked node as not schedulable, Talos won't change it on boot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 15:32:36 -07:00
Andrey Smirnov
97d18b1c43 test: fix cli tests after load-balancing got enabled
There were three problems:

* cli tests did commands in sequence assuming they all hit the same
node, but with load-balancing it's no longer true

* restart test was affected, as it hit different node for check after
restart, and it succeeded immediately, while on original node process
was still starting which resulted in failure in the next tests; replace
the check to make sure service is up and healthy, so that test leaves
cluster in a good state

* restart API response had wrong format (no message returned) which
resulted in failures with apid proxy (when used with `-n`)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 14:06:30 -07:00
Andrey Smirnov
50db9b6073 docs: update firecracker for new home of tc-redirect-tap plugin
See https://github.com/firecracker-microvm/firecracker-go-sdk/issues/174#issuecomment-655798205

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 11:47:28 -07:00
Andrey Smirnov
4f5660b22b test: fix sonobuoy delete
It expects kubeconfig as required argument.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 18:46:57 +03:00
Andrey Smirnov
5ecddf2866 feat: add round-robin LB policy to Talos client by default
Handling of multiple endpoints has already been implemented in #2094.

This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).

Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.

gprc handles dead/unresponsive nodes automatically for us.

`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.

On the additional improvements we got:

* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)

* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors

Fixes #1641

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 08:35:15 -07:00
Andrey Smirnov
4cc074cdba feat: implement API access to event history
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.

2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.

3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).

4. By default, Events API works the same - it shows no history and
starts streaming new events only.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 10:54:50 -07:00
Andrey Smirnov
aa687cf8cd fix: update the control plane cluster health check
Include kube-apiserver in the list of daemon sets to be checked, and
for each daemon set verify number of pods running and ready, as when
control plane is damaged daemon set properties are not updated properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 17:53:21 +03:00