talos

Author	SHA1	Message	Date
Andrey Smirnov	3934c78c94	fix: log messages properly when sequence/phase/task fails Previously log said `done` for the failed items which seemed not obvious when combined with the error message later on. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 06:43:03 -07:00
Andrey Smirnov	74413b1393	fix: ignore sequence lock errors in machined This prevents reboots when some actions triggers sequence while another sequence is still running. Fixes #2209 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 14:36:06 -07:00
dependabot[bot]	0aae950518	chore: bump lodash from 4.17.15 to 4.17.19 in /docs/website Bumps [lodash](https://github.com/lodash/lodash) from 4.17.15 to 4.17.19. - [Release notes](https://github.com/lodash/lodash/releases) - [Commits](https://github.com/lodash/lodash/compare/4.17.15...4.17.19) Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 11:17:18 -07:00
Artem Chernyshev	c70c08c8ce	chore: extract loadbalancer, network, crashdup and process from firecracker Second part of refactoring to split common logic for VM provisioners from Firecracker provisioner. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-20 11:03:03 -07:00
Andrey Smirnov	4cd6e7e200	refactor: use `humanize.Bytes` everywhere This removes dependency on `bytefmt` package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 07:26:33 -07:00
Andrey Smirnov	f047c42ae7	test: provider correct installer kernel args for firecracker Firecracker never executes the bootloader, so kernel args passed to the installer aren't used, but if the same disk image is used to boot Talos e.g. in `qemu`, it fails to set up console properly for example. This PR simply provides those kernel args to the installer so that they're persisted in the image. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 16:52:08 +03:00
Artem Chernyshev	19cd46459b	chore: initial extraction of base vm provisioner Created base provisioner struct for all VM based provisioners. Moved state.go and reflect.go to the common module. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 15:45:54 -07:00
steverfrancis	8dd81b0693	docs: use latest talosctl download link Update download example to reference latest release. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-18 14:45:52 -07:00
Artem Chernyshev	3d25ceb13e	chore: move inmemhttp from firecracker provisioner to internal/pkg/ To be reused in qemu provisioner later on. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 07:11:50 -07:00
Andrey Smirnov	ad99cb6421	feat: implement talosctl dashboard command This builds a simple CLI UI for Talos cluster monitoring. Some new APIs were added for monitoring based on Prometheus procfs package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 14:24:04 -07:00
Andrey Smirnov	56467af6a5	fix: wrap errors in upgrade API handler We often see in the logs errors like: ``` machined Unknown [/machine.MachineService/Upgrade] 6.984959636s unary Unauthorized (:authority=unix:/run/system/machined/machine.sock;content-type=application/grpc;proxyfrom=172.21.0.2,172.21.0.3,172.21.0.4;user-agent=grpc-go/1.26.0) ``` ``` * 172.21.0.4: rpc error: code = Unknown desc = Unauthorized ``` These errors are not related to the API handling, but most probably coming from one of the actions performed by the handler. Wrap the errors to get better debugging output. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 13:17:40 -07:00
Andrey Smirnov	2f4fb34baf	fix: update container name in docker crashdump Small bug resulted in container names being cut in the wrong way. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:49:29 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	41d5f7859a	chore: update golangci-lint to 1.28.3 Fixes #2272 `gofumpt` is now included into `golangci-lint`, but not the `gofumports`, so we keep it using it as separate binary, but we keep versions in sync with `golangci-lint`. This contains fixes from: * `gofumpt` (automated, mostly around octal constants) * `exhaustive` in `switch` statements * `noctx` (adding context with default timeout to http requests) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 08:05:42 -07:00
Andrey Smirnov	e82895ccc5	chore: upgrade Go to 1.14.5 go1.14.5 (released 2020/07/14) includes security fixes to the crypto/x509 and net/http packages. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 07:05:54 -07:00
Spencer Smith	f290f88160	chore: update clusterctl for CI testing This PR brings in the latest version of clusterctl that has built-in support for the talos repos. I'll be chasing this with a move to using the control-plane provider as well! Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-15 19:33:59 -04:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Spencer Smith	7d10677ee8	docs: update worker creation flags for azure docs This PR updates the worker flags for azure. Fixes an issue where, if you have multiple subnets and the talos one isn't default, the workers and control plane nodes came up on different subnets. Requires updating the firewalls if they don't come up in the same subnet, so this is better UX. Also added a note that azure support is broken in v0.5. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-15 12:03:33 -07:00
Andrew Rynhard	0617a10027	feat: upgrade Kubernetes to v1.19.0-rc.0 This brings in the latest version of Kubernetes. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-14 13:07:18 -07:00
Andrew Rynhard	fd62d457c3	release(v0.6.0-alpha.5): prepare release This is the official v0.6.0-alpha.5 release. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-13 13:51:43 -07:00
Andrew Rynhard	d2fc210684	chore: update meeting links This updates the office hours and contributors meetings links. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-13 12:50:17 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	19343e5f1a	test: workaround famous flaky Containerd.RunTwice test I'm not sure if this is going to stop it, but feels like we're hitting some race condition in containerd itself, so attempt to sleep a bit in between the container launches to avoid the errors. ``` === RUN TestContainerdSuite/TestRunTwice 2020/07/10 18:57:24 state Running: Started task 036de7a9-f667-48e8-905f-216233980f94 (PID 4964) for container 036de7a9-f667-48e8-905f-216233980f94 TestContainerdSuite/TestRunTwice: containerd_test.go:179: Error Trace: containerd_test.go:179 Error: Received unexpected error: failed to create task: "036de7a9-f667-48e8-905f-216233980f94": dial unix \00/containerd-shim/f88fff9fe9795db4229846b09b2da816f6bd981b8112345486ff3b5653892920.sock: connect: connection refused: unknown Test: TestContainerdSuite/TestRunTwice --- FAIL: TestContainerdSuite (6.62s) --- PASS: TestContainerdSuite/TestContainerCleanup (0.92s) --- PASS: TestContainerdSuite/TestImportFail (0.54s) --- PASS: TestContainerdSuite/TestImportSuccess (0.45s) --- PASS: TestContainerdSuite/TestRunLogs (0.26s) --- PASS: TestContainerdSuite/TestRunSuccess (0.31s) --- FAIL: TestContainerdSuite/TestRunTwice (0.39s) --- PASS: TestContainerdSuite/TestStopFailingAndRestarting (1.61s) --- PASS: TestContainerdSuite/TestStopSigKill (0.95s) FAIL ``` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 13:33:18 -07:00
Andrey Smirnov	2144b6a099	feat: add names to tasks and phases For each task, name follow function name for now (but it could be customized if needed). Phases are after main theme of the tasks inside the phase. Task and phase events are now displayed in `talosctl events`. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 13:09:50 -07:00
Artem Chernyshev	8fc352ec4f	feat: merge mode in talosctl kubeconfig New flag `-m` will enable merge mechanism in `talosctl kubeconfig` Command examples: ``` talosctl kubeconfig -m talosctl kubeconfig -m ~/.kube/config ``` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-10 12:39:30 -07:00
Andrey Smirnov	bdd011be30	test: update events test with more flow control Some of the event tests rely on publishing more events than buffer capacity, so consumers should be able to keep up with the producer to avoid hitting buffer overrun. Producer was rate-limited already, now adding rate limiting for consumer with higher burst, it should allow consumers to reach blocked state, but syncronizes them a bit with each other. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 12:14:57 -07:00
Andrey Smirnov	9590030a84	feat: print crash dump in `talosctl cluster create` on failure When cluster fails to be bootstrapped or it fails the health check, it's hard to find the root cause without the logs. This change adds optional crashdump (it dumps firecracker logs or docker logs) after provisioning failure. It's not enabled by default. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 11:54:07 -07:00
Andrey Smirnov	60155c8048	test: update tests for `pkg/follow` to be less time-dependent These tests rely a lot on `inotify` and interaction with the kernel. We try to make them less dependent on wall time and performance by using size hints and timeouts. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 21:00:43 +03:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	804f162756	fix: improve node uncordon tasks 1. Increase retry timeout. 2. Use timeout per attempt. 3. Check for node readiness as a gate to succeed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 09:26:47 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	97d18b1c43	test: fix cli tests after load-balancing got enabled There were three problems: * cli tests did commands in sequence assuming they all hit the same node, but with load-balancing it's no longer true * restart test was affected, as it hit different node for check after restart, and it succeeded immediately, while on original node process was still starting which resulted in failure in the next tests; replace the check to make sure service is up and healthy, so that test leaves cluster in a good state * restart API response had wrong format (no message returned) which resulted in failures with apid proxy (when used with `-n`) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 14:06:30 -07:00
Andrey Smirnov	50db9b6073	docs: update firecracker for new home of tc-redirect-tap plugin See https://github.com/firecracker-microvm/firecracker-go-sdk/issues/174#issuecomment-655798205 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 11:47:28 -07:00
Andrey Smirnov	4f5660b22b	test: fix sonobuoy delete It expects kubeconfig as required argument. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 18:46:57 +03:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	aa687cf8cd	fix: update the control plane cluster health check Include kube-apiserver in the list of daemon sets to be checked, and for each daemon set verify number of pods running and ready, as when control plane is damaged daemon set properties are not updated properly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 17:53:21 +03:00
Andrey Smirnov	ddbe9cfc2f	fix: update timeouts on service startup to match boot timeout There's a global timeout for all services to be up: it's 5 minutes. We need to make sure each service startup takes less than that, otherwise boot sequence is aborted and there's no way to see the error message for each particular service. Also propagate contexts correctly and set some default timeouts to make sure API operations are not hanging forever. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:39:36 -07:00
Andrey Smirnov	d210d7f1a3	fix: implement Unload() for services to make sure bootkube runs always The problem was that flow to re-run the service with different parameters was not consistent: it depends on whether services was loaded before or not, but that is not reliable, as e.g. with bootstrap API `bootkube` is loaded for the bootstrap and stays until reboot, and never loaded for any other boot. `Unload()` stops and removes the service completely so that new instance of the service could be loaded and started. This fixes the edge case with recovery API not running bootkube properly before reboot after bootstrap. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 07:15:45 -07:00
Andrey Smirnov	ac28b9c976	fix: print correct sequence/task duration Arguments to `defer` are evaluate at the time `defer` statement is executed, not during the `defer` callback. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-07 12:10:31 -07:00
Spencer Smith	67cddaff44	chore: wait for resource deletion in sonobuoy This PR fixes the fix where we try to cleanup sonobuoy. We did that successfully, but still got errors b/c we were immediately trying to create service accounts in a namespace that was being deleted. This should fix that. The sonobuoy default wait period is 1hr, should be plenty. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-07 10:58:47 -07:00
Spencer Smith	13bd77355e	chore: cleanup sonobuoy after failed attempts This PR will make sure that, if we're going to retry sonobuoy, we run the delete command first to clean up any dangling resources. Closes #2266. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-06 11:46:49 -07:00
Andrey Smirnov	a6b3bd2ff6	feat: implement service events This implements service events, adds test for events API based on service events as they're the easiest to generate on demand. Disabled validate test for 'metal' as it validates disk device against local system which doesn't make much sense. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-03 13:52:53 -07:00
Andrey Smirnov	0cd86f17c3	fix: provide default DNS domain to talosctl cluster create Fixes #2263 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-02 13:42:45 -07:00
Andrew Rynhard	a5a2d959ed	feat: upgrade runc to v1.0.0-rc90 This updates runc to the same version vendored by containerd. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-02 13:19:33 -07:00
Andrey Smirnov	219425f629	test: resolve old TODO item I had to copy over some oci stuff from newer package version, but as we for a long time use newer oic, we don't need a copy anymore. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-02 11:09:58 -07:00
Andrey Smirnov	b4abab0ed0	test: run integration pipeline nightly Just a copy of `integration` pipeline with the same trigger as `nightly` pipeline, so we can have two separate pipelines and two notifications for better visibility. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-02 08:22:33 -07:00
Andrew Rynhard	6c9ef2ae59	feat: upgrade Linux to v5.7.7 This brings in the latest stable version of Linux. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-01 14:57:48 -07:00
Andrey Smirnov	ba12095ac7	test: stabilize race unit-tests (circular, events) Fixes #2243 These tests rely on some kind of sync between readers and writers, as if circular buffer is overrun, test no longer runs as expected. We use time-sensitive rate limiter to limit write speed to make sure readers can always catch up. Lowering the rate should slow down writers and make tests more likely to succeed. For #2243, the failure was from buffer overrun: when overrun is detected, `Watch` function closes the channel (and test "receives" zero element). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-01 13:39:49 -07:00
Andrey Smirnov	0ea9a0a5ea	test: run `e2e-firecracker-short` for default pipeline only As many pipelines inherit steps from `default_steps`, take out `e2e-firecracker-short` from `default_steps`. `e2e` pipeline only relies on `e2e-docker`. `integration` pipeline does full firecracker run with `e2e-firecracker`. `release` pipeline manually pulls in `e2e-firecracker-short` to be on the safe side. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-01 13:16:25 -07:00

1 2 3 4 5 ...

1657 Commits