IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
For single-node upgrades if we drop the CRI state, API server won't
start after reboot and this breaks the control plane.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Due to the race between `Read()` and context cancellation, error might
be returned which we can safely ignore.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.
Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).
Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.
The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Latest version in 3.3 branch is 3.3.23, but it's broken, so we use previous
stable version.
Switch to official etcd gcr.io registry, early support for arm64.
Move `etcd` service to run in system containerd.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Previously log said `done` for the failed items which seemed not obvious
when combined with the error message later on.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This prevents reboots when some actions triggers sequence while another
sequence is still running.
Fixes#2209
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Second part of refactoring to split common logic for VM provisioners
from Firecracker provisioner.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
Firecracker never executes the bootloader, so kernel args passed to the
installer aren't used, but if the same disk image is used to boot Talos
e.g. in `qemu`, it fails to set up console properly for example.
This PR simply provides those kernel args to the installer so that
they're persisted in the image.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Created base provisioner struct for all VM based provisioners.
Moved state.go and reflect.go to the common module.
Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
This builds a simple CLI UI for Talos cluster monitoring.
Some new APIs were added for monitoring based on Prometheus procfs
package.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
We often see in the logs errors like:
```
machined Unknown [/machine.MachineService/Upgrade] 6.984959636s unary Unauthorized (:authority=unix:/run/system/machined/machine.sock;content-type=application/grpc;proxyfrom=172.21.0.2,172.21.0.3,172.21.0.4;user-agent=grpc-go/1.26.0)
```
```
* 172.21.0.4: rpc error: code = Unknown desc = Unauthorized
```
These errors are not related to the API handling, but most probably
coming from one of the actions performed by the handler. Wrap the errors
to get better debugging output.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#2316
Simply update dependencies we don't track on version level to be
compatible with Talos components (like etcd or k8s).
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Fixes#2272
`gofumpt` is now included into `golangci-lint`, but not the
`gofumports`, so we keep it using it as separate binary, but we keep
versions in sync with `golangci-lint`.
This contains fixes from:
* `gofumpt` (automated, mostly around octal constants)
* `exhaustive` in `switch` statements
* `noctx` (adding context with default timeout to http requests)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
go1.14.5 (released 2020/07/14) includes security fixes to the
crypto/x509 and net/http packages.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR brings in the latest version of clusterctl that has built-in
support for the talos repos. I'll be chasing this with a move to using
the control-plane provider as well!
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
This implements existing server-side health checks as defined in
`internal/pkg/cluster/checks` in Talos API.
Summary of changes:
* new `cluster` API
* `apid` now listens without auth on local file socket
* `cluster` API is for now implemented in `machined`, but we can move it
to the new service if we find it more appropriate
* `talosctl health` by default now does server-side health check
UX: `talosctl health` without arguments does health check for the
cluster if it has healthy K8s to return master/worker nodes. If needed,
node list can be overridden with flags.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
This PR updates the worker flags for azure. Fixes an issue where, if you
have multiple subnets and the talos one isn't default, the workers and
control plane nodes came up on different subnets. Requires updating the
firewalls if they don't come up in the same subnet, so this is better
UX.
Also added a note that azure support is broken in v0.5.
Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
This merges `osd` API into `machined`. API was copied from `osd` into
`machined`, and `osd` API was deprecated.
For backwards compatibility, `machined` still implements `osd` API, so
older Talos API clients can still talk to the node without changes.
Docs were updated. No functional changes.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
I'm not sure if this is going to stop it, but feels like we're hitting
some race condition in containerd itself, so attempt to sleep a bit in
between the container launches to avoid the errors.
```
=== RUN TestContainerdSuite/TestRunTwice
2020/07/10 18:57:24 state Running: Started task 036de7a9-f667-48e8-905f-216233980f94 (PID 4964) for container 036de7a9-f667-48e8-905f-216233980f94
TestContainerdSuite/TestRunTwice: containerd_test.go:179:
Error Trace: containerd_test.go:179
Error: Received unexpected error:
failed to create task: "036de7a9-f667-48e8-905f-216233980f94": dial unix \00/containerd-shim/f88fff9fe9795db4229846b09b2da816f6bd981b8112345486ff3b5653892920.sock: connect: connection refused: unknown
Test: TestContainerdSuite/TestRunTwice
--- FAIL: TestContainerdSuite (6.62s)
--- PASS: TestContainerdSuite/TestContainerCleanup (0.92s)
--- PASS: TestContainerdSuite/TestImportFail (0.54s)
--- PASS: TestContainerdSuite/TestImportSuccess (0.45s)
--- PASS: TestContainerdSuite/TestRunLogs (0.26s)
--- PASS: TestContainerdSuite/TestRunSuccess (0.31s)
--- FAIL: TestContainerdSuite/TestRunTwice (0.39s)
--- PASS: TestContainerdSuite/TestStopFailingAndRestarting (1.61s)
--- PASS: TestContainerdSuite/TestStopSigKill (0.95s)
FAIL
```
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
For each task, name follow function name for now (but it could be
customized if needed).
Phases are after main theme of the tasks inside the phase.
Task and phase events are now displayed in `talosctl events`.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Some of the event tests rely on publishing more events than buffer
capacity, so consumers should be able to keep up with the producer to
avoid hitting buffer overrun.
Producer was rate-limited already, now adding rate limiting for consumer
with higher burst, it should allow consumers to reach blocked state, but
syncronizes them a bit with each other.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
When cluster fails to be bootstrapped or it fails the health check, it's
hard to find the root cause without the logs.
This change adds optional crashdump (it dumps firecracker logs or docker
logs) after provisioning failure. It's not enabled by default.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
These tests rely a lot on `inotify` and interaction with the kernel. We
try to make them less dependent on wall time and performance by using
size hints and timeouts.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Previously we assumed that node 0 is the init node, and it can't be
reset. With new bootstrap API approach, there's no init node, and all
the nodes can be reset. This corrects the check to skip only the init
node, and with bootstrap API there's no init node (so no nodes are
skipped).
Fixes#2277
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
1. Increase retry timeout.
2. Use timeout per attempt.
3. Check for node readiness as a gate to succeed.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Talos will mark node as schedulable if it was previously cordoned by
Talos (for upgrade, reset, etc.)
If user marked node as not schedulable, Talos won't change it on boot.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
There were three problems:
* cli tests did commands in sequence assuming they all hit the same
node, but with load-balancing it's no longer true
* restart test was affected, as it hit different node for check after
restart, and it succeeded immediately, while on original node process
was still starting which resulted in failure in the next tests; replace
the check to make sure service is up and healthy, so that test leaves
cluster in a good state
* restart API response had wrong format (no message returned) which
resulted in failures with apid proxy (when used with `-n`)
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Handling of multiple endpoints has already been implemented in #2094.
This PR enables round-robin policy so that grpc picks up new endpoint
for each call (and not send each request to the first control plane
node).
Endpoint list is randomized to handle cases when only one request is
going to be sent, so that it doesn't go always to the first node in the
list.
gprc handles dead/unresponsive nodes automatically for us.
`talosctl cluster create` and provision tests switched to use
client-side load balancer for Talos API.
On the additional improvements we got:
* `talosctl` now reports correct node IP when using commands without
`-n`, not the loadbalancer IP (if using multiple endpoints of course)
* loadbalancer can't provide reliable handling of errors when upstream
server is unresponsive or there're no upstreams available, grpc returns
much more helpful errors
Fixes#1641
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
1. Add [xid-based](https://github.com/rs/xid) event IDs. Xids
are sortable and unique enough. Xids also encode event publishing
time with a second precision.
2. Add three ways to look back into event history: based on number of
events, on time and ID. Lookup via ID might be used to restart event
polling in case of broken API connection from the same moment.
3. Reimplement core event buffer with positions which are always
incremented instead of generation+index, this implementation is much
more simple (idea from circular buffer).
4. By default, Events API works the same - it shows no history and
starts streaming new events only.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Include kube-apiserver in the list of daemon sets to be checked, and
for each daemon set verify number of pods running and ready, as when
control plane is damaged daemon set properties are not updated properly.
Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>