1091 Commits

Author SHA1 Message Date
Andrew Rynhard
34eb691f81 fix: mount extra disks after system disk
The extra disks functionality was completely broken. One fundamental
issue was that we were attempting to create and mount the partitions
before the system disk was created. This moves the extra disks tasks to
the correct part of the boot sequnce. This also adds a simple check that
refuses to operate on a disk if any partitions are found.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-11 07:59:25 -08:00
Brad Beam
531e7d8144 feat: Add meminfo api
Add ability to retrieve node memory stats ( /proc/meminfo ).

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-10 21:02:43 -06:00
Andrew Rynhard
90fd52ad8c docs: fix roadmap layout
This adds margins to the roadmap to make it centered like the docs.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-10 15:59:51 -08:00
Andrew Rynhard
8795271c65 docs: update landing page
This updates our note on our commitment to staying in lockstep with
Kubernetes and Linux.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-10 15:39:30 -08:00
Brad Beam
8988c1c6a0 feat: Disable networkd configuration if ip kernel parameter is specified
This allows the kernel argument `ip` to take precedence over networking configuration. Documentation for
this parameter can be found here https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-10 12:07:01 -08:00
Andrew Rynhard
83ccbb1d2a docs: add public roadmap
This adds the first pass out our public roadmap. It is intended to be a
living document.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-09 05:04:16 -08:00
Andrey Smirnov
b3fd85174a fix: remove duplicate line
Just remove duplicate line (to satisfy commit message linter).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-08 16:58:46 -08:00
Andrey Smirnov
add4a8d5ab fix: recover from panics in grpc servers
This installs default middleware to recover from panics (convert them to
errors) in all the grpc servers by default.

Slight refactoring to allow that as grpc can only accept Unary/Stream
interceptors only once.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-08 15:28:18 -08:00
Andrey Smirnov
6231b7db3c chore: run gofumports after protoc-gen
This fixes import order and guarantess clean diff after `make generate`.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-09 00:20:45 +03:00
Brad Beam
7897374ff1 feat: Add support for streaming apis in apid
This brings in the recent updates to protoc-gen-proxy to allow support
for proxying streaming api requests. We artificially limit it to only the first
target specified in the list while we work through what multi target stream
support looks like.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-08 14:22:30 -06:00
Spencer Smith
6d5bbaf7c8 chore: re-enable e2e for aws clusters
This PR adds in the necessary manifests and fixes to deploy aws clusters
as part of e2e testing.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-11-07 15:32:14 -05:00
Brad Beam
32fe6297fe feat(networkd): Add support for custom nameservers
This adds support for specify nameservers in the config.

When I was adding tests I noticed the netconf code for setting
the MTU caused a panic. Given how we retrieve the data ( device centric )
in the static addressing method, I think this is safe to remove.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-07 13:57:02 -06:00
Andrey Smirnov
8fdf71789e test: add 'integration-test' to e2e runs
Also refactored `integration-test` build as a generic step to be shared
by basic-integration and e2e-integration steps.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-07 06:30:34 -08:00
Andrew Rynhard
85638f5d90 fix: pass x509 options to NewCertificateFromCSR
This ensures that certificates are generated with the supplied options.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-06 19:43:56 -08:00
Andrey Smirnov
cdda81df66 test: add k8s integration tests
Once again, mostly groundwork and one simple test for node versions.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-06 17:08:44 -08:00
Brad Beam
6519c575f8 feat: Add support for setting container output to stdout
This allows the config.Debug setting to control container output to allow better troubleshooting.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-06 10:15:49 -06:00
Andrey Smirnov
27235b9ae1 chore: add simple health check for etcd service
Fixes #1419

This is required to avoid later startup failures while trying to connect
to etcd if it hasn't actually bootstrapped.

This health check does just connectivity check, no quorum/leader checks,
as they should depend on cluster state in general.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-06 05:50:40 -08:00
Andrey Smirnov
e2d9cc5438 fix: remove global variable in bootkube
Just a small nit, as all the services share same package, global
variable with generic name might lead to fun collisions.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-06 05:50:25 -08:00
Andrew Rynhard
8ca4d49347 fix: conditionally create a new etcd cluster
This fixes a long standing issue with upgrading the init node. We
currently have no way of knowing whether the init node should join an
existing etcd cluster, or create a new one. This makes use of the node's
metadata to determine if the node has already created the etcd cluster.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-05 19:05:02 -08:00
Andrew Rynhard
17cce5468f feat: add metadata file to boot partition
This introduces the notion of metadata for a node. In this initial pass
there are only two fields. A timestamp to indicate when the install was
performed, and a field to indicate if the install was performed as part
of an upgrade.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-05 17:59:45 -08:00
Andrey Smirnov
551fa45d33 test: add CLI integration test
This starts with a very simple test for `osctl version` using regexps as
output of the command depends a lot on current version.

We might use more of 'gold' matches for other commands potentially.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-05 17:59:23 -08:00
Spencer Smith
ce7a0e36cc chore: re-enable e2e testing
This PR will re-enable e2e testing by using the new cluster api
bootstrap provider and various infra providers.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-11-05 16:53:38 -05:00
Brad Beam
988acfee51 docs: Add machine.env section
Adds information about supported environment variables.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-05 12:41:49 -08:00
Andrew Rynhard
7419281fa5 chore: prepare release v0.3.0-alpha.6
This is the official v0.3.0-alpha.6 release.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-05 11:21:49 -08:00
Brad Beam
4b3cc34ab0 fix: Disable support for proxy variables for apid.
Since APId/gRPC connections should never go through a proxy, we will explicitly exclude
these environment variables from apid.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-05 10:34:33 -08:00
Andrew Rynhard
06009f66c8 fix: sleep in NTP query loop
We need to sleep between successful queries so we don't hit the NTP
server too often.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-05 09:45:53 -08:00
Brad Beam
db00c83207 fix: Add host network namespace to networkd and ntpd
Without host network namespace, networkd and ntpd didnt work properly. NTP failed to
start up because it couldnt reach the ntp servers and networkd failed to configure
the interfaces and display interface information.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-05 09:45:15 -08:00
Andrey Smirnov
b0aef2cf22 test: add integration test framework
This is just first steps and core foundation.

It can be used like:

```
make integration.test
osctl cluster create
build/integration.test -test.v
```

This should run the test against the Docker instance.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-11-05 17:21:38 +03:00
Andrew Rynhard
03a09c2294 refactor: rename Helper to Client
The name helper isn't very good. This renames it to Client. A new func
was also added, NewForConfig, that will allow for the creation of the helper
client from an arbitrary Kubernetes REST config.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 19:31:27 -08:00
Andrew Rynhard
c9732458c1 fix: verify that all etcd members are running before upgrading
This verifies that all etcd members are running before performing an
upgrade. Without this we run the risk of destroying the etcd cluster.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 18:17:13 -08:00
Andrew Rynhard
33468f4d6a fix: don't use 127.0.0.1 for etcd client
We should use 127.0.0.1 only in special cases (like when bootstrapping
the cluster). There is the potential that the local etcd member is
unhealthy and/or not responsive. This adds function for creating an etcd
client configured with all control plane node IPs in order to better
handle this case.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 16:54:15 -08:00
Andrew Rynhard
2febace0a4 chore: remove bind mounts from OSD
Now that the APIs have been moved, we no longer need these bind mounts.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 15:27:18 -08:00
Andrew Rynhard
a82ed0c5b7 fix: add etcd member conditionally
We should add an etcd member only if it has not already been added. When
a control plane node is rebooted, or down for whatever reason, when it
comes back up it will attempt to add itself again. When it does so, the
cluster is unhelathy due to the fact that the node was down. A feature
of etcd called "strict-reconfig-check" prevents any member adds when the
cluster is unhealthy since doing so would cause the cluster to lose
quorum.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 15:24:30 -08:00
Brad Beam
41a4741bca refactor: Move logs to machined
This moves Logs endpoint to machined to reduce the mount footprint of osd.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-04 15:04:13 -08:00
Brad Beam
a4e1479b07 refactor: Move kubeconfig to machined
This moves the Kubeconfig api endpoint to machined and consolidates the
"read a file" code into machined. This also changes Kubeconfig to
use the CopyOut method which changes Kubeconfig to a streaming grpc call.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-04 14:45:23 -08:00
Brad Beam
3fd8abf426 chore: Move data messages to common proto
This is to allows reuse across multiple apis.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-11-04 14:24:41 -06:00
Andrew Rynhard
18f5c50a32 fix: stop etcd and remove data-dir
We need to stop etcd earlier in the upgrade sequence to prevent machined
from trying to restart it after leaving the etcd cluster. We also need
to remove the data-dir since all the data becomes invalid once we leave
the etcd cluster.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 11:48:28 -08:00
Andrew Rynhard
8f10462795 fix: use CRI to stop containers
Using the CRI seems to be more dependable in ensuring that we don't hit
EBUSY when trying to reset the system disk after stopping all
containers.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 04:55:37 -08:00
Andrew Rynhard
7eb5b6b748 fix: verify system disk not in use
This adds an extra phase to the upgrade sequence that ensures we don't hit
EBUSY when attempting to delete the ephemeral partition. This is crucial
because if we fail to do so, the disk does not have a bootloader and we
effectively destroy the machine. It works by attempting to open the block
device with O_EXCL: If the block device is in use by the system (e.g., mounted)
, open() fails with the error EBUSY.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-04 04:46:01 -08:00
Andrew Rynhard
f43e42d845 chore: install customization requirements with ONBUILD
There is no need for these packages to be in the base image. This moves
to installing them using ONBUILD.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 22:51:05 -08:00
Andrew Rynhard
eb75d1fb47 refactor: use retry package in ntpd
This moves to using the retry package for retrying NTP queries. It also
adds some additional logging that is useful when NTP queries fail.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 19:21:39 -08:00
Andrew Rynhard
e9296bed6e fix: retry BLKPG operations
There are cases where we can see EBUSY when attempting to use the BLKPG
ioctl. The recommendation seems to be to retry when this happens.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 18:20:54 -08:00
Andrew Rynhard
22f073b32e refactor: unify service stop on upgrade
This simplifies service shutdown tasks. Shutdown and upgrade events now
use the same task.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 18:05:46 -08:00
Andrew Rynhard
f411491484 fix: stop leaking file descriptors
This ensures that probed block devices are closed.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 17:15:54 -08:00
Andrew Rynhard
e81b3d11a8 feat: output machined logs to /dev/kmsg and file
Since dmesg is not streamed, it becomes difficult to debug issues with
machined. This fixes that by setting up the logging of machine to go to
/dev/kmsg and to a log file.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 12:53:13 -08:00
Andrew Rynhard
3887b1e5b6 chore: force overwrite of output file
This adds the force option to gzip.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 10:31:17 -08:00
Andrew Rynhard
eb0c8e9e4b refactor: use constants.SystemContainerdNamespace
This replaces hardcoded instances of "system" with
constants.SystemContainerdNamespace.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 10:14:43 -08:00
Andrew Rynhard
3ce6f34995 feat: add timestamp to installed file
This adds a timestamp to /boot/installed. It can be useful for
determining the last known successful install.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 10:06:26 -08:00
Andrew Rynhard
45a3406fba fix: send SIGKILL to hanging containers
This addresses an issue caused by containers that refuse to exit with
SIGTERM. After sending SIGTERM, we send SIGKILL after a timeout of one minute.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 09:44:49 -08:00
Andrew Rynhard
d15e226998 fix: be explicit about installs
Trying to be smart about whether our not an install is being performed
as part of an upgrade has proven to be error prone. This moves to
perform installs with explicit args.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-11-03 09:31:14 -08:00