1657 Commits

Author SHA1 Message Date
Andrey Smirnov
3934c78c94 fix: log messages properly when sequence/phase/task fails
Previously log said `done` for the failed items which seemed not obvious
when combined with the error message later on.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 06:43:03 -07:00
Andrey Smirnov
74413b1393 fix: ignore sequence lock errors in machined
This prevents reboots when some actions triggers sequence while another
sequence is still running.

Fixes #2209

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 14:36:06 -07:00
dependabot[bot]
0aae950518 chore: bump lodash from 4.17.15 to 4.17.19 in /docs/website
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.15 to 4.17.19.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.15...4.17.19)

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 11:17:18 -07:00
Artem Chernyshev
c70c08c8ce chore: extract loadbalancer, network, crashdup and process from firecracker
Second part of refactoring to split common logic for VM provisioners
from Firecracker provisioner.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-20 11:03:03 -07:00
Andrey Smirnov
4cd6e7e200 refactor: use humanize.Bytes everywhere
This removes dependency on `bytefmt` package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 07:26:33 -07:00
Andrey Smirnov
f047c42ae7 test: provider correct installer kernel args for firecracker
Firecracker never executes the bootloader, so kernel args passed to the
installer aren't used, but if the same disk image is used to boot Talos
e.g. in `qemu`, it fails to set up console properly for example.

This PR simply provides those kernel args to the installer so that
they're persisted in the image.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-20 16:52:08 +03:00
Artem Chernyshev
19cd46459b chore: initial extraction of base vm provisioner
Created base provisioner struct for all VM based provisioners.
Moved state.go and reflect.go to the common module.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 15:45:54 -07:00
steverfrancis
8dd81b0693 docs: use latest talosctl download link
Update download example to reference latest release.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-18 14:45:52 -07:00
Artem Chernyshev
3d25ceb13e chore: move inmemhttp from firecracker provisioner to internal/pkg/
To be reused in qemu provisioner later on.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-18 07:11:50 -07:00
Andrey Smirnov
ad99cb6421 feat: implement talosctl dashboard command
This builds a simple CLI UI for Talos cluster monitoring.

Some new APIs were added for monitoring based on Prometheus procfs
package.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 14:24:04 -07:00
Andrey Smirnov
56467af6a5 fix: wrap errors in upgrade API handler
We often see in the logs errors like:

```
machined Unknown [/machine.MachineService/Upgrade] 6.984959636s unary Unauthorized (:authority=unix:/run/system/machined/machine.sock;content-type=application/grpc;proxyfrom=172.21.0.2,172.21.0.3,172.21.0.4;user-agent=grpc-go/1.26.0)
```

```
* 172.21.0.4: rpc error: code = Unknown desc = Unauthorized
```

These errors are not related to the API handling, but most probably
coming from one of the actions performed by the handler. Wrap the errors
to get better debugging output.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 13:17:40 -07:00
Andrey Smirnov
2f4fb34baf fix: update container name in docker crashdump
Small bug resulted in container names being cut in the wrong way.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:49:29 -07:00
Andrey Smirnov
1a0e1bc393 chore: update module dependencies
Fixes #2316

Simply update dependencies we don't track on version level to be
compatible with Talos components (like etcd or k8s).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 12:00:50 -07:00
Andrey Smirnov
41d5f7859a chore: update golangci-lint to 1.28.3
Fixes #2272

`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.

This contains fixes from:

* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 08:05:42 -07:00
Andrey Smirnov
e82895ccc5 chore: upgrade Go to 1.14.5
go1.14.5 (released 2020/07/14) includes security fixes to the
crypto/x509 and net/http packages.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-16 07:05:54 -07:00
Spencer Smith
f290f88160 chore: update clusterctl for CI testing
This PR brings in the latest version of clusterctl that has built-in
support for the talos repos. I'll be chasing this with a move to using
the control-plane provider as well!

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 19:33:59 -04:00
Andrey Smirnov
c54639e541 feat: implement server-side API for cluster health checks
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.

Summary of changes:

* new `cluster` API

* `apid` now listens without auth on local file socket

* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate

* `talosctl health` by default now does server-side health check

UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-15 13:52:13 -07:00
Spencer Smith
7d10677ee8 docs: update worker creation flags for azure docs
This PR updates the worker flags for azure. Fixes an issue where, if you
have multiple subnets and the talos one isn't default, the workers and
control plane nodes came up on different subnets. Requires updating the
firewalls if they don't come up in the same subnet, so this is better
UX.

Also added a note that azure support is broken in v0.5.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 12:03:33 -07:00
Andrew Rynhard
0617a10027 feat: upgrade Kubernetes to v1.19.0-rc.0
This brings in the latest version of Kubernetes.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-14 13:07:18 -07:00
Andrew Rynhard
fd62d457c3 release(v0.6.0-alpha.5): prepare release
This is the official v0.6.0-alpha.5 release.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-13 13:51:43 -07:00
Andrew Rynhard
d2fc210684 chore: update meeting links
This updates the office hours and contributors meetings links.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-13 12:50:17 -07:00
Andrey Smirnov
cbb7ca8390 refactor: merge osd into machined
This merges `osd` API into `machined`. API was copied from `osd` into
`machined`, and `osd` API was deprecated.

For backwards compatibility, `machined` still implements `osd` API, so
older Talos API clients can still talk to the node without changes.

Docs were updated. No functional changes.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-13 12:50:00 -07:00
Andrey Smirnov
19343e5f1a test: workaround famous flaky Containerd.RunTwice test
I'm not sure if this is going to stop it, but feels like we're hitting
some race condition in containerd itself, so attempt to sleep a bit in
between the container launches to avoid the errors.

```
=== RUN   TestContainerdSuite/TestRunTwice
2020/07/10 18:57:24 state Running: Started task 036de7a9-f667-48e8-905f-216233980f94 (PID 4964) for container 036de7a9-f667-48e8-905f-216233980f94
    TestContainerdSuite/TestRunTwice: containerd_test.go:179:
        	Error Trace:	containerd_test.go:179
        	Error:      	Received unexpected error:
        	            	failed to create task: "036de7a9-f667-48e8-905f-216233980f94": dial unix \00/containerd-shim/f88fff9fe9795db4229846b09b2da816f6bd981b8112345486ff3b5653892920.sock: connect: connection refused: unknown
        	Test:       	TestContainerdSuite/TestRunTwice
--- FAIL: TestContainerdSuite (6.62s)
    --- PASS: TestContainerdSuite/TestContainerCleanup (0.92s)
    --- PASS: TestContainerdSuite/TestImportFail (0.54s)
    --- PASS: TestContainerdSuite/TestImportSuccess (0.45s)
    --- PASS: TestContainerdSuite/TestRunLogs (0.26s)
    --- PASS: TestContainerdSuite/TestRunSuccess (0.31s)
    --- FAIL: TestContainerdSuite/TestRunTwice (0.39s)
    --- PASS: TestContainerdSuite/TestStopFailingAndRestarting (1.61s)
    --- PASS: TestContainerdSuite/TestStopSigKill (0.95s)
FAIL
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 13:33:18 -07:00
Andrey Smirnov
2144b6a099 feat: add names to tasks and phases
For each task, name follow function name for now (but it could be
customized if needed).

Phases are after main theme of the tasks inside the phase.

Task and phase events are now displayed in `talosctl events`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 13:09:50 -07:00
Artem Chernyshev
8fc352ec4f feat: merge mode in talosctl kubeconfig
New flag `-m` will enable merge mechanism in `talosctl kubeconfig`

Command examples:

```
talosctl kubeconfig -m

talosctl kubeconfig -m ~/.kube/config
```

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-10 12:39:30 -07:00
Andrey Smirnov
bdd011be30 test: update events test with more flow control
Some of the event tests rely on publishing more events than buffer
capacity, so consumers should be able to keep up with the producer to
avoid hitting buffer overrun.

Producer was rate-limited already, now adding rate limiting for consumer
with higher burst, it should allow consumers to reach blocked state, but
syncronizes them a bit with each other.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 12:14:57 -07:00
Andrey Smirnov
9590030a84 feat: print crash dump in talosctl cluster create on failure
When cluster fails to be bootstrapped or it fails the health check, it's
hard to find the root cause without the logs.

This change adds optional crashdump (it dumps firecracker logs or docker
logs) after provisioning failure. It's not enabled by default.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 11:54:07 -07:00
Andrey Smirnov
60155c8048 test: update tests for pkg/follow to be less time-dependent
These tests rely a lot on `inotify` and interaction with the kernel. We
try to make them less dependent on wall time and performance by using
size hints and timeouts.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 21:00:43 +03:00
Andrey Smirnov
931237b23c test: update init node check in reset API tests
Previously we assumed that node 0 is the init node, and it can't be
reset. With new bootstrap API approach, there's no init node, and all
the nodes can be reset. This corrects the check to skip only the init
node, and with bootstrap API there's no init node (so no nodes are
skipped).

Fixes #2277

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 10:48:14 -07:00
Andrey Smirnov
804f162756 fix: improve node uncordon tasks
1. Increase retry timeout.

2. Use timeout per attempt.

3. Check for node readiness as a gate to succeed.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 09:26:47 -07:00
Andrey Smirnov
a4a2a3c83a feat: uncordon nodes automatically on boot
Talos will mark node as schedulable if it was previously cordoned by
Talos (for upgrade, reset, etc.)

If user marked node as not schedulable, Talos won't change it on boot.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 15:32:36 -07:00
Andrey Smirnov
97d18b1c43 test: fix cli tests after load-balancing got enabled
There were three problems:

* cli tests did commands in sequence assuming they all hit the same
node, but with load-balancing it's no longer true

* restart test was affected, as it hit different node for check after
restart, and it succeeded immediately, while on original node process
was still starting which resulted in failure in the next tests; replace
the check to make sure service is up and healthy, so that test leaves
cluster in a good state

* restart API response had wrong format (no message returned) which
resulted in failures with apid proxy (when used with `-n`)

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 14:06:30 -07:00
Andrey Smirnov
50db9b6073 docs: update firecracker for new home of tc-redirect-tap plugin
See https://github.com/firecracker-microvm/firecracker-go-sdk/issues/174#issuecomment-655798205

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 11:47:28 -07:00
Andrey Smirnov
4f5660b22b test: fix sonobuoy delete
It expects kubeconfig as required argument.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 18:46:57 +03:00
Andrey Smirnov
5ecddf2866 feat: add round-robin LB policy to Talos client by default
Handling of multiple endpoints has already been implemented in #2094.

This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).

Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.

gprc handles dead/unresponsive nodes automatically for us.

`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.

On the additional improvements we got:

* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)

* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors

Fixes #1641

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 08:35:15 -07:00
Andrey Smirnov
4cc074cdba feat: implement API access to event history
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.

2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.

3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).

4. By default, Events API works the same - it shows no history and
starts streaming new events only.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 10:54:50 -07:00
Andrey Smirnov
aa687cf8cd fix: update the control plane cluster health check
Include kube-apiserver in the list of daemon sets to be checked, and
for each daemon set verify number of pods running and ready, as when
control plane is damaged daemon set properties are not updated properly.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 17:53:21 +03:00
Andrey Smirnov
ddbe9cfc2f fix: update timeouts on service startup to match boot timeout
There's a global timeout for all services to be up: it's 5 minutes. We
need to make sure each service startup takes less than that, otherwise
boot sequence is aborted and there's no way to see the error message for
each particular service.

Also propagate contexts correctly and set some default timeouts to make
sure API operations are not hanging forever.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 07:39:36 -07:00
Andrey Smirnov
d210d7f1a3 fix: implement Unload() for services to make sure bootkube runs always
The problem was that flow to re-run the service with different
parameters was not consistent: it depends on whether services was loaded
before or not, but that is not reliable, as e.g. with bootstrap API
`bootkube` is loaded for the bootstrap and stays until reboot, and never
loaded for any other boot.

`Unload()` stops and removes the service completely so that new instance
of the service could be loaded and started.

This fixes the edge case with recovery API not running bootkube properly
before reboot after bootstrap.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-08 07:15:45 -07:00
Andrey Smirnov
ac28b9c976 fix: print correct sequence/task duration
Arguments to `defer` are evaluate at the time `defer` statement is
executed, not during the `defer` callback.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-07 12:10:31 -07:00
Spencer Smith
67cddaff44 chore: wait for resource deletion in sonobuoy
This PR fixes the fix where we try to cleanup sonobuoy. We did that
successfully, but still got errors b/c we were immediately trying to
create service accounts in a namespace that was being deleted. This
should fix that. The sonobuoy default wait period is 1hr, should be
plenty.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-07 10:58:47 -07:00
Spencer Smith
13bd77355e chore: cleanup sonobuoy after failed attempts
This PR will make sure that, if we're going to retry sonobuoy, we run
the delete command first to clean up any dangling resources.

Closes #2266.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-06 11:46:49 -07:00
Andrey Smirnov
a6b3bd2ff6 feat: implement service events
This implements service events, adds test for events API based on
service events as they're the easiest to generate on demand.

Disabled validate test for 'metal' as it validates disk device against
local system which doesn't make much sense.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-03 13:52:53 -07:00
Andrey Smirnov
0cd86f17c3 fix: provide default DNS domain to talosctl cluster create
Fixes #2263

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-02 13:42:45 -07:00
Andrew Rynhard
a5a2d959ed feat: upgrade runc to v1.0.0-rc90
This updates runc to the same version vendored by containerd.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-02 13:19:33 -07:00
Andrey Smirnov
219425f629 test: resolve old TODO item
I had to copy over some oci stuff from newer package version, but as we
for a long time use newer oic, we don't need a copy anymore.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-02 11:09:58 -07:00
Andrey Smirnov
b4abab0ed0 test: run integration pipeline nightly
Just a copy of `integration` pipeline with the same trigger as `nightly`
pipeline, so we can have two separate pipelines and two notifications
for better visibility.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-02 08:22:33 -07:00
Andrew Rynhard
6c9ef2ae59 feat: upgrade Linux to v5.7.7
This brings in the latest stable version of Linux.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-01 14:57:48 -07:00
Andrey Smirnov
ba12095ac7 test: stabilize race unit-tests (circular, events)
Fixes #2243

These tests rely on some kind of sync between readers and writers, as if
circular buffer is overrun, test no longer runs as expected.

We use time-sensitive rate limiter to limit write speed to make sure
readers can always catch up. Lowering the rate should slow down writers
and make tests more likely to succeed.

For #2243, the failure was from buffer overrun: when overrun is
detected, `Watch` function closes the channel (and test "receives" zero
element).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 13:39:49 -07:00
Andrey Smirnov
0ea9a0a5ea test: run e2e-firecracker-short for default pipeline only
As many pipelines inherit steps from `default_steps`, take out
`e2e-firecracker-short` from `default_steps`.

`e2e` pipeline only relies on `e2e-docker`.

`integration` pipeline does full firecracker run with `e2e-firecracker`.

`release` pipeline manually pulls in `e2e-firecracker-short` to be on
the safe side.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 13:16:25 -07:00