docs: add troubleshooting control plane documentation

Describe common failures and debugging approach.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Co-authored-by: Spencer Smith <rsmitty@users.noreply.github.com>
This commit is contained in:
Andrey Smirnov 2021-03-09 21:49:28 +03:00 committed by talos-bot
parent 485cb1262f
commit ed9673e50a

View File

@ -0,0 +1,442 @@
---
title: "Troubleshooting Control Plane"
description: "Troubleshoot control plane failures for running cluster and bootstrap process."
---
<!-- markdownlint-disable MD026 -->
This guide is written as series of topics and detailed answers for each topic.
It starts with basics of control plane and goes into Talos specifics.
This document mostly applies only to Talos 0.9 control plane based on static pods.
If Talos was upgraded from version 0.8, it might be still running self-hosted control plane, current status can
be checked with the command `talosctl get bootstrapstatus`:
```bash
$ talosctl -n <IP> get bs
NODE NAMESPACE TYPE ID VERSION SELF HOSTED
172.20.0.2 runtime BootstrapStatus control-plane 1 false
```
In this guide we assume that Talos client config is available and Talos API access is available.
Kubernetes client configuration can be pulled from control plane nodes with `talosctl -n <IP> kubeconfig`
(this command works before Kubernetes is fully booted).
### What is a control plane node?
Talos nodes which have `.machine.type` of `init` and `controlplane` are control plane nodes.
The only difference between `init` and `controlplane` nodes is that `init` node automatically
bootstraps a single-node `etcd` cluster on a first boot if the etcd data directory is empty.
A node with type `init` can be replaced with a `controlplane` node which is triggered to run etcd bootstrap
with `talosctl --nodes <IP> bootstrap` command.
Use of `init` type nodes is discouraged, as it might lead to split-brain scenario if one node in
existing cluster is reinstalled while config type is still `init`.
It is critical to make sure only one control plane runs in bootstrap mode (either with node type `init` or
via bootstrap API/`talosctl bootstrap`), as having more than node in bootstrap mode leads to split-brain
scenario (multiple etcd clusters are built instead of a single cluster).
### What is special about control plane node?
Control plane nodes in Talos run `etcd` which provides data store for Kubernetes and Kubernetes control plane
components (`kube-apiserver`, `kube-controller-manager` and `kube-scheduler`).
Control plane nodes are tainted by default to prevent workloads from being scheduled to control plane nodes.
### How many control plane nodes should be deployed?
With a single control plane node, cluster is not HA: if that single node experiences hardware failure, cluster
control plane is broken and can't be recovered.
Single control plane node clusters are still used as test clusters and in edge deployments, but it should be noted that this setup is not HA.
Number of control plane should be odd (1, 3, 5, ...), as with even number of nodes, etcd quorum doesn't tolerate
failures correctly: e.g. with 2 control plane nodes quorum is 2, so failure of any node breaks quorum, so this
setup is almost equivalent to single control plane node cluster.
With three control plane nodes cluster can tolerate a failure of any single control plane node.
With five control plane nodes cluster can tolerate failure of any two control plane nodes.
### What is control plane endpoint?
Kubernetes requires having a control plane endpoint which points to any healthy API server running on a control plane node.
Control plane endpoint is specified as URL like `https://endpoint:6443/`.
At any point in time, even during failures control plane endpoint should point to a healthy API server instance.
As `kube-apiserver` runs with host network, control plane endpoint should point to one of the control plane node IPs: `node1:6443`, `node2:6443`, ...
For single control plane node clusters, control plane endpoint might be `https://IP:6443/` or `https://DNS:6443/`, where `IP` is the IP of the control plane node and `DNS` points to `IP`.
DNS form of the endpoint allows to change the IP address of the control plane if that IP changes over time.
For HA clusters, control plane can be implemented as:
* TCP L7 loadbalancer with active health checks against port 6443
* round-robin DNS with active health checks against port 6443
* BGP anycast IP with health checks
* virtual shared L2 IP
<!-- TODO link to the guide -->
It is critical that control plane endpoint works correctly during cluster bootstrap phase, as nodes discover
each other using control plane endpoint.
### kubelet is not running on control plane node
Service `kubelet` should be running on control plane node as soon as networking is configured:
```bash
$ talosctl -n <IP> service kubelet
NODE 172.20.0.2
ID kubelet
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (2m54s ago)
[Running]: Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused (3m4s ago)
[Running]: Started task kubelet (PID 2334) for container kubelet (3m6s ago)
[Preparing]: Creating service runner (3m6s ago)
[Preparing]: Running pre state (3m15s ago)
[Waiting]: Waiting for service "timed" to be "up" (3m15s ago)
[Waiting]: Waiting for service "cri" to be "up", service "timed" to be "up" (3m16s ago)
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m18s ago)
```
If `kubelet` is not running, it might be caused by wrong configuration, check `kubelet` logs
with `talosctl logs`:
```bash
$ talosctl -n <IP> logs kubelet
172.20.0.2: I0305 20:45:07.756948 2334 controller.go:101] kubelet config controller: starting controller
172.20.0.2: I0305 20:45:07.756995 2334 controller.go:267] kubelet config controller: ensuring filesystem is set up correctly
172.20.0.2: I0305 20:45:07.757000 2334 fsstore.go:59] kubelet config controller: initializing config checkpoints directory "/etc/kubernetes/kubelet/store"
```
### etcd is not running on bootstrap node
`etcd` should be running on bootstrap node immediately (bootstrap node is either `init` node or `controlplane` node
after `talosctl bootstrap` command was issued).
When node boots for the first time, `etcd` data directory `/var/lib/etcd` directory is empty and Talos launches `etcd` in a mode to build the initial cluster of a single node.
At this time `/var/lib/etcd` directory becomes non-empty and `etcd` runs as usual.
If `etcd` is not running, check service `etcd` state:
```bash
$ talosctl -n <IP> service etcd
NODE 172.20.0.2
ID etcd
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (3m21s ago)
[Running]: Started task etcd (PID 2343) for container etcd (3m26s ago)
[Preparing]: Creating service runner (3m26s ago)
[Preparing]: Running pre state (3m26s ago)
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m26s ago)
```
If service is stuck in `Preparing` state for bootstrap node, it might be related to slow network - at this stage
Talos pulls `etcd` image from the container registry.
If `etcd` service is crashing and restarting, check service logs with `talosctl -n <IP> logs etcd`.
Most common reasons for crashes are:
* wrong arguments passed via `extraArgs` in the configuration;
* booting Talos on non-empty disk with previous Talos installation, `/var/lib/etcd` contains data from old cluster.
### etcd is not running on non-bootstrap control plane node
Service `etcd` on non-bootstrap control plane node waits for Kubernetes to boot successfully on bootstrap node to find
other peers to build a cluster.
As soon as bootstrap node boots Kubernetes control plane components, and `kubectl get endpoints` returns IP of bootstrap control plane node, other control plane nodes will start joining the cluster followed by Kubernetes control plane components on each control plane node.
### Kubernetes static pod definitions are not generated
Talos should write down static pod definitions for the Kubernetes control plane:
```bash
$ talosctl -n <IP> ls /etc/kubernetes/manifests
NODE NAME
172.20.0.2 .
172.20.0.2 talos-kube-apiserver.yaml
172.20.0.2 talos-kube-controller-manager.yaml
172.20.0.2 talos-kube-scheduler.yaml
```
If static pod definitions are not rendered, check `etcd` and `kubelet` service health (see above),
and controller runtime logs (`talosctl logs controller-runtime`).
### Talos prints error `an error on the server ("") has prevented the request from succeeding`
This is expected during initial cluster bootstrap and sometimes after a reboot:
```bash
[ 70.093289] [talos] task labelNodeAsMaster (1/1): starting
[ 80.094038] [talos] retrying error: an error on the server ("") has prevented the request from succeeding (get nodes talos-default-master-1)
```
Initially `kube-apiserver` component is not running yet, and it takes some time before it becomes fully up
during bootstrap (image should be pulled from the Internet, etc.)
Once control plane endpoint is up Talos should proceed.
If Talos doesn't proceed further, it might be a configuration issue.
In any case, status of control plane components can be checked with `talosctl containers -k`:
```bash
$ talosctl -n <IP> containers --kubernetes
NODE NAMESPACE ID IMAGE PID STATUS
172.20.0.2 k8s.io kube-system/kube-apiserver-talos-default-master-1 k8s.gcr.io/pause:3.2 2539 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-apiserver-talos-default-master-1:kube-apiserver k8s.gcr.io/kube-apiserver:v1.20.4 2572 CONTAINER_RUNNING
```
If `kube-apiserver` shows as `CONTAINER_EXITED`, it might have exited due to configuration error.
Logs can be checked with `taloctl logs --kubernetes` (or with `-k` as a shorthand):
```bash
$ talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-master-1:kube-apiserver
172.20.0.2: 2021-03-05T20:46:13.133902064Z stderr F 2021/03/05 20:46:13 Running command:
172.20.0.2: 2021-03-05T20:46:13.133933824Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.2: 2021-03-05T20:46:13.133938524Z stderr F Run from directory:
172.20.0.2: 2021-03-05T20:46:13.13394154Z stderr F Executable path: /usr/local/bin/kube-apiserver
...
```
### Talos prints error `nodes "talos-default-master-1" not found`
This error means that `kube-apiserver` is up, and control plane endpoint is healthy, but `kubelet` hasn't got
its client certificate yet and wasn't able to register itself.
For the `kubelet` to get its client certificate, following conditions should apply:
* control plane endpoint is healthy (`kube-apiserver` is running)
* bootstrap manifests got successfully deployed (for CSR auto-approval)
* `kube-controller-manager` is running
CSR state can be checked with `kubectl get csr`:
```bash
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
```
### Talos prints error `node not ready`
Node in Kubernetes is marked as `Ready` once CNI is up.
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
If the node is stuck in this state for too long, check CNI pods and logs with `kubectl`, usually
CNI resources are created in `kube-system` namespace.
For example, for Talos default Flannel CNI:
```bash
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
kube-flannel-25drx 1/1 Running 0 23m
kube-flannel-8lmb6 1/1 Running 0 23m
kube-flannel-gl7nx 1/1 Running 0 23m
kube-flannel-jknt9 1/1 Running 0 23m
...
```
### Talos prints error `x509: certificate signed by unknown authority`
Full error might look like:
```bash
x509: certificate signed by unknown authority (possiby because of crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"
```
Commonly, the control plane endpoint points to a different cluster, as the client certificate
generated by Talos doesn't match CA of the cluster at control plane endpoint.
### etcd is running on bootstrap node, but stuck in `pre` state on non-bootstrap nodes
Please see question `etcd is not running on non-bootstrap control plane node`.
### Checking `kube-controller-manager` and `kube-scheduler`
If control plane endpoint is up, status of the pods can be performed with `kubectl`:
```bash
$ kubectl get pods -n kube-system -l k8s-app=kube-controller-manager
NAME READY STATUS RESTARTS AGE
kube-controller-manager-talos-default-master-1 1/1 Running 0 28m
kube-controller-manager-talos-default-master-2 1/1 Running 0 28m
kube-controller-manager-talos-default-master-3 1/1 Running 0 28m
```
If control plane endpoint is not up yet, container status can be queried with
`talosctl containers --kubernetes`:
```bash
$ talosctl -n <IP> c -k
NODE NAMESPACE ID IMAGE PID STATUS
...
172.20.0.2 k8s.io kube-system/kube-controller-manager-talos-default-master-1 k8s.gcr.io/pause:3.2 2547 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager k8s.gcr.io/kube-controller-manager:v1.20.4 2580 CONTAINER_RUNNING
172.20.0.2 k8s.io kube-system/kube-scheduler-talos-default-master-1 k8s.gcr.io/pause:3.2 2638 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-scheduler-talos-default-master-1:kube-scheduler k8s.gcr.io/kube-scheduler:v1.20.4 2670 CONTAINER_RUNNING
...
```
If some of the containers are not running, it could be that image is still being pulled.
Otherwise process might crashing, in that case logs can be checked with `talosctl logs --kubernetes <containerID>`:
```bash
$ talosctl -n <IP> logs -k kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291667526Z stderr F 2021/03/09 13:59:34 Running command:
172.20.0.3: 2021-03-09T13:59:34.291702262Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.3: 2021-03-09T13:59:34.291707121Z stderr F Run from directory:
172.20.0.3: 2021-03-09T13:59:34.291710908Z stderr F Executable path: /usr/local/bin/kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291719163Z stderr F Args (comma-delimited): /usr/local/bin/kube-controller-manager,--allocate-node-cidrs=true,--cloud-provider=,--cluster-cidr=10.244.0.0/16,--service-cluster-ip-range=10.96.0.0/12,--cluster-signing-cert-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--cluster-signing-key-file=/system/secrets/kubernetes/kube-controller-manager/ca.key,--configure-cloud-routes=false,--kubeconfig=/system/secrets/kubernetes/kube-controller-manager/kubeconfig,--leader-elect=true,--root-ca-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--service-account-private-key-file=/system/secrets/kubernetes/kube-controller-manager/service-account.key,--profiling=false
172.20.0.3: 2021-03-09T13:59:34.293870359Z stderr F 2021/03/09 13:59:34 Now listening for interrupts
172.20.0.3: 2021-03-09T13:59:34.761113762Z stdout F I0309 13:59:34.760982 10 serving.go:331] Generated self-signed cert in-memory
...
```
### Checking controller runtime logs
Talos runs a set of controllers which work on resources to build and support Kubernetes control plane.
Some debugging information can be queried from the controller logs with `talosctl logs controller-runtime`:
```bash
$ talosctl -n <IP> logs controller-runtime
172.20.0.2: 2021/03/09 13:57:11 secrets.EtcdController: controller starting
172.20.0.2: 2021/03/09 13:57:11 config.MachineTypeController: controller starting
172.20.0.2: 2021/03/09 13:57:11 k8s.ManifestApplyController: controller starting
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.BootstrapStatusController: controller starting
172.20.0.2: 2021/03/09 13:57:11 v1alpha1.TimeStatusController: controller starting
...
```
Controllers run reconcile loop, so they might be starting, failing and restarting, that is expected behavior.
Things to look for:
`v1alpha1.BootstrapStatusController: bootkube initialized status not found`: control plane is not self-hosted, running with static pods.
`k8s.KubeletStaticPodController: writing static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml"`: static pod definitions were rendered successfully.
`k8s.ManifestApplyController: controller failed: error creating mapping for object /v1/Secret/bootstrap-token-q9pyzr: an error on the server ("") has prevented the request from succeeding`: control plane endpoint is not up yet, bootstrap manifests can't be injected, controller is going to retry.
`k8s.KubeletStaticPodController: controller failed: error refreshing pod status: error fetching pod status: an error on the server ("Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)") has prevented the request from succeeding`: kubelet hasn't been able to contact `kube-apiserver` yet to push pod status, controller
is going to retry.
`k8s.ManifestApplyController: created rbac.authorization.k8s.io/v1/ClusterRole/psp:privileged`: one of the bootstrap manifests got successfully applied.
`secrets.KubernetesController: controller failed: missing cluster.aggregatorCA secret`: Talos is running with 0.8 configuration, if the cluster was upgraded from 0.8, this is expected, and conversion process will fix machine config
automatically.
If this cluster was bootstrapped with version 0.9, machine configuration should be regenerated with 0.9 talosctl.
If there are no new messages in `controller-runtime` log, it means that controllers finished reconciling successfully.
### Checking static pod definitions
Talos generates static pod definitions for `kube-apiserver`, `kube-controller-manager`, and `kube-scheduler`
components based on machine configuration.
These definitions can be checked as resources with `talosctl get staticpods`:
```bash
$ talosctl -n <IP> get staticpods -o yaml
get staticpods -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: StaticPods.kubernetes.talos.dev
id: kube-apiserver
version: 2
phase: running
finalizers:
- k8s.StaticPodStatus("kube-apiserver")
spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
talos.dev/config-version: "1"
talos.dev/secrets-version: "1"
creationTimestamp: null
labels:
k8s-app: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
...
```
Status of the static pods can queried with `talosctl get staticpodstatus`:
```bash
$ talosctl -n <IP> get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.2 controlplane StaticPodStatus kube-system/kube-apiserver-talos-default-master-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-controller-manager-talos-default-master-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-scheduler-talos-default-master-1 1 True
```
Most important status is `Ready` printed as last column, complete status can be fetched by adding `-o yaml` flag.
### Checking bootstrap manifests
As part of bootstrap process, Talos injects bootstrap manifests into Kubernetes API server.
There are two kinds of manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
```bash
$ talosctl -n <IP> get manifests --namespace=controlplane
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
```
```bash
$ talosctl -n <IP> get manifests --namespace=extras
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 extras Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
```
Details of each manifests can be queried by adding `-o yaml`:
```bash
$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: Manifests.kubernetes.talos.dev
id: 01-csr-approver-role-binding
version: 1
phase: running
spec:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system-bootstrap-approve-node-client-csr
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers
```
### Worker node is stuck with `apid` health check failures
Control plane nodes have enough secret material to generate `apid` server certificates, but worker nodes
depend on control plane `trustd` services to generate certificates.
Worker nodes wait for `kubelet` to join the cluster, then `apid` queries Kubernetes endpoints via control plane
endpoint to find `trustd` endpoints, and use `trustd` to issue the certficiate.
So if `apid` health checks is failing on worker node:
* make sure control plane endpoint is healthy
* check that worker node `kubelet` joined the cluster