talos

Author	SHA1	Message	Date
Andrew Rynhard	568e398578	release(v0.6.0-alpha.6): prepare release This is the official v0.6.0-alpha.6 release. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-27 15:19:50 -07:00
Andrew Rynhard	fbf3fd5304	chore: set default CIDRs This reverts some temporary changes we made to get around some CI issues. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-27 14:24:08 -07:00
Andrew Rynhard	1fdbcba763	fix: log interface on validation error Print the failing interface name if addressing check errors. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-27 14:22:18 -07:00
Andrey Smirnov	564111d9d5	chore: use outer docker as buildkit instance This should provide caching for the builds. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 13:55:29 -07:00
Andrey Smirnov	ca1972e708	fix: skip removing CRI state when doing upgrade with preserve For single-node upgrades if we drop the CRI state, API server won't start after reboot and this breaks the control plane. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 13:19:35 -07:00
Andrey Smirnov	3b42f56f43	fix: skip vmware platform for !amd64 This fixes the build on arm64. The fix itself is part of PR #2156. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 12:54:04 -07:00
Andrey Smirnov	b5b70ec858	chore: upgrade pkgs and tools for Go 1.14.6 This also brings in multi-arch pkgs and tools, but we're not consuming arm64 images yet. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 12:33:53 -07:00
Andrew Rynhard	1f31d24e55	chore: use Kubernetes pipelines This moves to using Kubernetes pipelines. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-27 12:09:53 -07:00
Andrey Smirnov	c85608b8d9	test: add an option to bind docker to specific host IP This allows to override default `0.0.0.0` (`*`) to a specific IP to avoid conflicts. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 21:13:28 +03:00
Andrey Smirnov	13c0052a6c	test: fix racy test ReaderNoFollow Due to the race between `Read()` and context cancellation, error might be returned which we can safely ignore. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-27 21:13:11 +03:00
Andrey Smirnov	3d8418a689	feat: force nodes to be set in `talosctl` commands using the API With load-balancing enabled by default running `talosctl` without `--nodes` is risky, as it might hit any control plane by default without `--nodes`. Only two commands do not enforce this check, as they do their own node contexts: `crashdump` and `health` (client-side). Integration tests were updated to always supply `--nodes` cli argument, while doing that I refactored the storage for discovered nodes to use existing `cluster.Info` interface. The downside is that with e2e CAPI tests CLI tests will be mostly skipped as we don't support discovery in CLI tests at the momemnt. This can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for node discovery. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 12:17:43 -07:00
Andrey Smirnov	f23c9111d1	feat: upgrade etcd to 3.3.22 version Latest version in 3.3 branch is 3.3.23, but it's broken, so we use previous stable version. Switch to official etcd gcr.io registry, early support for arm64. Move `etcd` service to run in system containerd. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 09:44:43 -07:00
Andrey Smirnov	70a65cbb01	feat: make partitions on additional disk without size occupy full disk Fixes #2214 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 07:33:07 -07:00
Andrey Smirnov	3934c78c94	fix: log messages properly when sequence/phase/task fails Previously log said `done` for the failed items which seemed not obvious when combined with the error message later on. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-21 06:43:03 -07:00
Andrey Smirnov	74413b1393	fix: ignore sequence lock errors in machined This prevents reboots when some actions triggers sequence while another sequence is still running. Fixes #2209 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 14:36:06 -07:00
dependabot[bot]	0aae950518	chore: bump lodash from 4.17.15 to 4.17.19 in /docs/website Bumps [lodash](https://github.com/lodash/lodash) from 4.17.15 to 4.17.19. - [Release notes](https://github.com/lodash/lodash/releases) - [Commits](https://github.com/lodash/lodash/compare/4.17.15...4.17.19) Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 11:17:18 -07:00
Artem Chernyshev	c70c08c8ce	chore: extract loadbalancer, network, crashdup and process from firecracker Second part of refactoring to split common logic for VM provisioners from Firecracker provisioner. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-20 11:03:03 -07:00
Andrey Smirnov	4cd6e7e200	refactor: use `humanize.Bytes` everywhere This removes dependency on `bytefmt` package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 07:26:33 -07:00
Andrey Smirnov	f047c42ae7	test: provider correct installer kernel args for firecracker Firecracker never executes the bootloader, so kernel args passed to the installer aren't used, but if the same disk image is used to boot Talos e.g. in `qemu`, it fails to set up console properly for example. This PR simply provides those kernel args to the installer so that they're persisted in the image. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-20 16:52:08 +03:00
Artem Chernyshev	19cd46459b	chore: initial extraction of base vm provisioner Created base provisioner struct for all VM based provisioners. Moved state.go and reflect.go to the common module. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 15:45:54 -07:00
steverfrancis	8dd81b0693	docs: use latest talosctl download link Update download example to reference latest release. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-18 14:45:52 -07:00
Artem Chernyshev	3d25ceb13e	chore: move inmemhttp from firecracker provisioner to internal/pkg/ To be reused in qemu provisioner later on. Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-18 07:11:50 -07:00
Andrey Smirnov	ad99cb6421	feat: implement talosctl dashboard command This builds a simple CLI UI for Talos cluster monitoring. Some new APIs were added for monitoring based on Prometheus procfs package. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 14:24:04 -07:00
Andrey Smirnov	56467af6a5	fix: wrap errors in upgrade API handler We often see in the logs errors like: ``` machined Unknown [/machine.MachineService/Upgrade] 6.984959636s unary Unauthorized (:authority=unix:/run/system/machined/machine.sock;content-type=application/grpc;proxyfrom=172.21.0.2,172.21.0.3,172.21.0.4;user-agent=grpc-go/1.26.0) ``` ``` * 172.21.0.4: rpc error: code = Unknown desc = Unauthorized ``` These errors are not related to the API handling, but most probably coming from one of the actions performed by the handler. Wrap the errors to get better debugging output. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 13:17:40 -07:00
Andrey Smirnov	2f4fb34baf	fix: update container name in docker crashdump Small bug resulted in container names being cut in the wrong way. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:49:29 -07:00
Andrey Smirnov	1a0e1bc393	chore: update module dependencies Fixes #2316 Simply update dependencies we don't track on version level to be compatible with Talos components (like etcd or k8s). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 12:00:50 -07:00
Andrey Smirnov	41d5f7859a	chore: update golangci-lint to 1.28.3 Fixes #2272 `gofumpt` is now included into `golangci-lint`, but not the `gofumports`, so we keep it using it as separate binary, but we keep versions in sync with `golangci-lint`. This contains fixes from: * `gofumpt` (automated, mostly around octal constants) * `exhaustive` in `switch` statements * `noctx` (adding context with default timeout to http requests) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 08:05:42 -07:00
Andrey Smirnov	e82895ccc5	chore: upgrade Go to 1.14.5 go1.14.5 (released 2020/07/14) includes security fixes to the crypto/x509 and net/http packages. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-16 07:05:54 -07:00
Spencer Smith	f290f88160	chore: update clusterctl for CI testing This PR brings in the latest version of clusterctl that has built-in support for the talos repos. I'll be chasing this with a move to using the control-plane provider as well! Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-15 19:33:59 -04:00
Andrey Smirnov	c54639e541	feat: implement server-side API for cluster health checks This implements existing server-side health checks as defined in `internal/pkg/cluster/checks` in Talos API. Summary of changes: * new `cluster` API * `apid` now listens without auth on local file socket * `cluster` API is for now implemented in `machined`, but we can move it to the new service if we find it more appropriate * `talosctl health` by default now does server-side health check UX: `talosctl health` without arguments does health check for the cluster if it has healthy K8s to return master/worker nodes. If needed, node list can be overridden with flags. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-15 13:52:13 -07:00
Spencer Smith	7d10677ee8	docs: update worker creation flags for azure docs This PR updates the worker flags for azure. Fixes an issue where, if you have multiple subnets and the talos one isn't default, the workers and control plane nodes came up on different subnets. Requires updating the firewalls if they don't come up in the same subnet, so this is better UX. Also added a note that azure support is broken in v0.5. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2020-07-15 12:03:33 -07:00
Andrew Rynhard	0617a10027	feat: upgrade Kubernetes to v1.19.0-rc.0 This brings in the latest version of Kubernetes. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-14 13:07:18 -07:00
Andrew Rynhard	fd62d457c3	release(v0.6.0-alpha.5): prepare release This is the official v0.6.0-alpha.5 release. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-13 13:51:43 -07:00
Andrew Rynhard	d2fc210684	chore: update meeting links This updates the office hours and contributors meetings links. Signed-off-by: Andrew Rynhard <andrew@rynhard.io>	2020-07-13 12:50:17 -07:00
Andrey Smirnov	cbb7ca8390	refactor: merge osd into machined This merges `osd` API into `machined`. API was copied from `osd` into `machined`, and `osd` API was deprecated. For backwards compatibility, `machined` still implements `osd` API, so older Talos API clients can still talk to the node without changes. Docs were updated. No functional changes. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-13 12:50:00 -07:00
Andrey Smirnov	19343e5f1a	test: workaround famous flaky Containerd.RunTwice test I'm not sure if this is going to stop it, but feels like we're hitting some race condition in containerd itself, so attempt to sleep a bit in between the container launches to avoid the errors. ``` === RUN TestContainerdSuite/TestRunTwice 2020/07/10 18:57:24 state Running: Started task 036de7a9-f667-48e8-905f-216233980f94 (PID 4964) for container 036de7a9-f667-48e8-905f-216233980f94 TestContainerdSuite/TestRunTwice: containerd_test.go:179: Error Trace: containerd_test.go:179 Error: Received unexpected error: failed to create task: "036de7a9-f667-48e8-905f-216233980f94": dial unix \00/containerd-shim/f88fff9fe9795db4229846b09b2da816f6bd981b8112345486ff3b5653892920.sock: connect: connection refused: unknown Test: TestContainerdSuite/TestRunTwice --- FAIL: TestContainerdSuite (6.62s) --- PASS: TestContainerdSuite/TestContainerCleanup (0.92s) --- PASS: TestContainerdSuite/TestImportFail (0.54s) --- PASS: TestContainerdSuite/TestImportSuccess (0.45s) --- PASS: TestContainerdSuite/TestRunLogs (0.26s) --- PASS: TestContainerdSuite/TestRunSuccess (0.31s) --- FAIL: TestContainerdSuite/TestRunTwice (0.39s) --- PASS: TestContainerdSuite/TestStopFailingAndRestarting (1.61s) --- PASS: TestContainerdSuite/TestStopSigKill (0.95s) FAIL ``` Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 13:33:18 -07:00
Andrey Smirnov	2144b6a099	feat: add names to tasks and phases For each task, name follow function name for now (but it could be customized if needed). Phases are after main theme of the tasks inside the phase. Task and phase events are now displayed in `talosctl events`. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 13:09:50 -07:00
Artem Chernyshev	8fc352ec4f	feat: merge mode in talosctl kubeconfig New flag `-m` will enable merge mechanism in `talosctl kubeconfig` Command examples: ``` talosctl kubeconfig -m talosctl kubeconfig -m ~/.kube/config ``` Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>	2020-07-10 12:39:30 -07:00
Andrey Smirnov	bdd011be30	test: update events test with more flow control Some of the event tests rely on publishing more events than buffer capacity, so consumers should be able to keep up with the producer to avoid hitting buffer overrun. Producer was rate-limited already, now adding rate limiting for consumer with higher burst, it should allow consumers to reach blocked state, but syncronizes them a bit with each other. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 12:14:57 -07:00
Andrey Smirnov	9590030a84	feat: print crash dump in `talosctl cluster create` on failure When cluster fails to be bootstrapped or it fails the health check, it's hard to find the root cause without the logs. This change adds optional crashdump (it dumps firecracker logs or docker logs) after provisioning failure. It's not enabled by default. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 11:54:07 -07:00
Andrey Smirnov	60155c8048	test: update tests for `pkg/follow` to be less time-dependent These tests rely a lot on `inotify` and interaction with the kernel. We try to make them less dependent on wall time and performance by using size hints and timeouts. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 21:00:43 +03:00
Andrey Smirnov	931237b23c	test: update init node check in reset API tests Previously we assumed that node 0 is the init node, and it can't be reset. With new bootstrap API approach, there's no init node, and all the nodes can be reset. This corrects the check to skip only the init node, and with bootstrap API there's no init node (so no nodes are skipped). Fixes #2277 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 10:48:14 -07:00
Andrey Smirnov	804f162756	fix: improve node uncordon tasks 1. Increase retry timeout. 2. Use timeout per attempt. 3. Check for node readiness as a gate to succeed. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-10 09:26:47 -07:00
Andrey Smirnov	a4a2a3c83a	feat: uncordon nodes automatically on boot Talos will mark node as schedulable if it was previously cordoned by Talos (for upgrade, reset, etc.) If user marked node as not schedulable, Talos won't change it on boot. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 15:32:36 -07:00
Andrey Smirnov	97d18b1c43	test: fix cli tests after load-balancing got enabled There were three problems: * cli tests did commands in sequence assuming they all hit the same node, but with load-balancing it's no longer true * restart test was affected, as it hit different node for check after restart, and it succeeded immediately, while on original node process was still starting which resulted in failure in the next tests; replace the check to make sure service is up and healthy, so that test leaves cluster in a good state * restart API response had wrong format (no message returned) which resulted in failures with apid proxy (when used with `-n`) Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 14:06:30 -07:00
Andrey Smirnov	50db9b6073	docs: update firecracker for new home of tc-redirect-tap plugin See https://github.com/firecracker-microvm/firecracker-go-sdk/issues/174#issuecomment-655798205 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 11:47:28 -07:00
Andrey Smirnov	4f5660b22b	test: fix sonobuoy delete It expects kubeconfig as required argument. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 18:46:57 +03:00
Andrey Smirnov	5ecddf2866	feat: add round-robin LB policy to Talos client by default Handling of multiple endpoints has already been implemented in #2094. This PR enables round-robin policy so that grpc picks up new endpoint for each call (and not send each request to the first control plane node). Endpoint list is randomized to handle cases when only one request is going to be sent, so that it doesn't go always to the first node in the list. gprc handles dead/unresponsive nodes automatically for us. `talosctl cluster create` and provision tests switched to use client-side load balancer for Talos API. On the additional improvements we got: * `talosctl` now reports correct node IP when using commands without `-n`, not the loadbalancer IP (if using multiple endpoints of course) * loadbalancer can't provide reliable handling of errors when upstream server is unresponsive or there're no upstreams available, grpc returns much more helpful errors Fixes #1641 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-09 08:35:15 -07:00
Andrey Smirnov	4cc074cdba	feat: implement API access to event history 1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids are sortable and unique enough. Xids also encode event publishing time with a second precision. 2. Add three ways to look back into event history: based on number of events, on time and ID. Lookup via ID might be used to restart event polling in case of broken API connection from the same moment. 3. Reimplement core event buffer with positions which are always incremented instead of generation+index, this implementation is much more simple (idea from circular buffer). 4. By default, Events API works the same - it shows no history and starts streaming new events only. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 10:54:50 -07:00
Andrey Smirnov	aa687cf8cd	fix: update the control plane cluster health check Include kube-apiserver in the list of daemon sets to be checked, and for each daemon set verify number of pods running and ready, as when control plane is damaged daemon set properties are not updated properly. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2020-07-08 17:53:21 +03:00

1 2 3 4 5 ...

1670 Commits