talos

Author	SHA1	Message	Date
Andrew Rynhard	34eb691f81	fix: mount extra disks after system disk The extra disks functionality was completely broken. One fundamental issue was that we were attempting to create and mount the partitions before the system disk was created. This moves the extra disks tasks to the correct part of the boot sequnce. This also adds a simple check that refuses to operate on a disk if any partitions are found. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-11 07:59:25 -08:00
Brad Beam	531e7d8144	feat: Add meminfo api Add ability to retrieve node memory stats ( /proc/meminfo ). Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-10 21:02:43 -06:00
Andrew Rynhard	90fd52ad8c	docs: fix roadmap layout This adds margins to the roadmap to make it centered like the docs. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-10 15:59:51 -08:00
Andrew Rynhard	8795271c65	docs: update landing page This updates our note on our commitment to staying in lockstep with Kubernetes and Linux. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-10 15:39:30 -08:00
Brad Beam	8988c1c6a0	feat: Disable networkd configuration if `ip` kernel parameter is specified This allows the kernel argument `ip` to take precedence over networking configuration. Documentation for this parameter can be found here https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-10 12:07:01 -08:00
Andrew Rynhard	83ccbb1d2a	docs: add public roadmap This adds the first pass out our public roadmap. It is intended to be a living document. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-09 05:04:16 -08:00
Andrey Smirnov	b3fd85174a	fix: remove duplicate line Just remove duplicate line (to satisfy commit message linter). Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-08 16:58:46 -08:00
Andrey Smirnov	add4a8d5ab	fix: recover from panics in grpc servers This installs default middleware to recover from panics (convert them to errors) in all the grpc servers by default. Slight refactoring to allow that as grpc can only accept Unary/Stream interceptors only once. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-08 15:28:18 -08:00
Andrey Smirnov	6231b7db3c	chore: run gofumports after protoc-gen This fixes import order and guarantess clean diff after `make generate`. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-09 00:20:45 +03:00
Brad Beam	7897374ff1	feat: Add support for streaming apis in apid This brings in the recent updates to protoc-gen-proxy to allow support for proxying streaming api requests. We artificially limit it to only the first target specified in the list while we work through what multi target stream support looks like. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-08 14:22:30 -06:00
Spencer Smith	6d5bbaf7c8	chore: re-enable e2e for aws clusters This PR adds in the necessary manifests and fixes to deploy aws clusters as part of e2e testing. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2019-11-07 15:32:14 -05:00
Brad Beam	32fe6297fe	feat(networkd): Add support for custom nameservers This adds support for specify nameservers in the config. When I was adding tests I noticed the netconf code for setting the MTU caused a panic. Given how we retrieve the data ( device centric ) in the static addressing method, I think this is safe to remove. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-07 13:57:02 -06:00
Andrey Smirnov	8fdf71789e	test: add 'integration-test' to e2e runs Also refactored `integration-test` build as a generic step to be shared by basic-integration and e2e-integration steps. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-07 06:30:34 -08:00
Andrew Rynhard	85638f5d90	fix: pass x509 options to NewCertificateFromCSR This ensures that certificates are generated with the supplied options. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-06 19:43:56 -08:00
Andrey Smirnov	cdda81df66	test: add k8s integration tests Once again, mostly groundwork and one simple test for node versions. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-06 17:08:44 -08:00
Brad Beam	6519c575f8	feat: Add support for setting container output to stdout This allows the config.Debug setting to control container output to allow better troubleshooting. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-06 10:15:49 -06:00
Andrey Smirnov	27235b9ae1	chore: add simple health check for etcd service Fixes #1419 This is required to avoid later startup failures while trying to connect to etcd if it hasn't actually bootstrapped. This health check does just connectivity check, no quorum/leader checks, as they should depend on cluster state in general. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-06 05:50:40 -08:00
Andrey Smirnov	e2d9cc5438	fix: remove global variable in bootkube Just a small nit, as all the services share same package, global variable with generic name might lead to fun collisions. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-06 05:50:25 -08:00
Andrew Rynhard	8ca4d49347	fix: conditionally create a new etcd cluster This fixes a long standing issue with upgrading the init node. We currently have no way of knowing whether the init node should join an existing etcd cluster, or create a new one. This makes use of the node's metadata to determine if the node has already created the etcd cluster. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-05 19:05:02 -08:00
Andrew Rynhard	17cce5468f	feat: add metadata file to boot partition This introduces the notion of metadata for a node. In this initial pass there are only two fields. A timestamp to indicate when the install was performed, and a field to indicate if the install was performed as part of an upgrade. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-05 17:59:45 -08:00
Andrey Smirnov	551fa45d33	test: add CLI integration test This starts with a very simple test for `osctl version` using regexps as output of the command depends a lot on current version. We might use more of 'gold' matches for other commands potentially. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:59:23 -08:00
Spencer Smith	ce7a0e36cc	chore: re-enable e2e testing This PR will re-enable e2e testing by using the new cluster api bootstrap provider and various infra providers. Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>	2019-11-05 16:53:38 -05:00
Brad Beam	988acfee51	docs: Add machine.env section Adds information about supported environment variables. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-05 12:41:49 -08:00
Andrew Rynhard	7419281fa5	chore: prepare release v0.3.0-alpha.6 This is the official v0.3.0-alpha.6 release. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-05 11:21:49 -08:00
Brad Beam	4b3cc34ab0	fix: Disable support for proxy variables for apid. Since APId/gRPC connections should never go through a proxy, we will explicitly exclude these environment variables from apid. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-05 10:34:33 -08:00
Andrew Rynhard	06009f66c8	fix: sleep in NTP query loop We need to sleep between successful queries so we don't hit the NTP server too often. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-05 09:45:53 -08:00
Brad Beam	db00c83207	fix: Add host network namespace to networkd and ntpd Without host network namespace, networkd and ntpd didnt work properly. NTP failed to start up because it couldnt reach the ntp servers and networkd failed to configure the interfaces and display interface information. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-05 09:45:15 -08:00
Andrey Smirnov	b0aef2cf22	test: add integration test framework This is just first steps and core foundation. It can be used like: ``` make integration.test osctl cluster create build/integration.test -test.v ``` This should run the test against the Docker instance. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>	2019-11-05 17:21:38 +03:00
Andrew Rynhard	03a09c2294	refactor: rename Helper to Client The name helper isn't very good. This renames it to Client. A new func was also added, NewForConfig, that will allow for the creation of the helper client from an arbitrary Kubernetes REST config. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 19:31:27 -08:00
Andrew Rynhard	c9732458c1	fix: verify that all etcd members are running before upgrading This verifies that all etcd members are running before performing an upgrade. Without this we run the risk of destroying the etcd cluster. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 18:17:13 -08:00
Andrew Rynhard	33468f4d6a	fix: don't use 127.0.0.1 for etcd client We should use 127.0.0.1 only in special cases (like when bootstrapping the cluster). There is the potential that the local etcd member is unhealthy and/or not responsive. This adds function for creating an etcd client configured with all control plane node IPs in order to better handle this case. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 16:54:15 -08:00
Andrew Rynhard	2febace0a4	chore: remove bind mounts from OSD Now that the APIs have been moved, we no longer need these bind mounts. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 15:27:18 -08:00
Andrew Rynhard	a82ed0c5b7	fix: add etcd member conditionally We should add an etcd member only if it has not already been added. When a control plane node is rebooted, or down for whatever reason, when it comes back up it will attempt to add itself again. When it does so, the cluster is unhelathy due to the fact that the node was down. A feature of etcd called "strict-reconfig-check" prevents any member adds when the cluster is unhealthy since doing so would cause the cluster to lose quorum. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 15:24:30 -08:00
Brad Beam	41a4741bca	refactor: Move logs to machined This moves Logs endpoint to machined to reduce the mount footprint of osd. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-04 15:04:13 -08:00
Brad Beam	a4e1479b07	refactor: Move kubeconfig to machined This moves the Kubeconfig api endpoint to machined and consolidates the "read a file" code into machined. This also changes Kubeconfig to use the CopyOut method which changes Kubeconfig to a streaming grpc call. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-04 14:45:23 -08:00
Brad Beam	3fd8abf426	chore: Move data messages to common proto This is to allows reuse across multiple apis. Signed-off-by: Brad Beam <brad.beam@talos-systems.com>	2019-11-04 14:24:41 -06:00
Andrew Rynhard	18f5c50a32	fix: stop etcd and remove data-dir We need to stop etcd earlier in the upgrade sequence to prevent machined from trying to restart it after leaving the etcd cluster. We also need to remove the data-dir since all the data becomes invalid once we leave the etcd cluster. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 11:48:28 -08:00
Andrew Rynhard	8f10462795	fix: use CRI to stop containers Using the CRI seems to be more dependable in ensuring that we don't hit EBUSY when trying to reset the system disk after stopping all containers. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 04:55:37 -08:00
Andrew Rynhard	7eb5b6b748	fix: verify system disk not in use This adds an extra phase to the upgrade sequence that ensures we don't hit EBUSY when attempting to delete the ephemeral partition. This is crucial because if we fail to do so, the disk does not have a bootloader and we effectively destroy the machine. It works by attempting to open the block device with O_EXCL: If the block device is in use by the system (e.g., mounted) , open() fails with the error EBUSY. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-04 04:46:01 -08:00
Andrew Rynhard	f43e42d845	chore: install customization requirements with ONBUILD There is no need for these packages to be in the base image. This moves to installing them using ONBUILD. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 22:51:05 -08:00
Andrew Rynhard	eb75d1fb47	refactor: use retry package in ntpd This moves to using the retry package for retrying NTP queries. It also adds some additional logging that is useful when NTP queries fail. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 19:21:39 -08:00
Andrew Rynhard	e9296bed6e	fix: retry BLKPG operations There are cases where we can see EBUSY when attempting to use the BLKPG ioctl. The recommendation seems to be to retry when this happens. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 18:20:54 -08:00
Andrew Rynhard	22f073b32e	refactor: unify service stop on upgrade This simplifies service shutdown tasks. Shutdown and upgrade events now use the same task. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 18:05:46 -08:00
Andrew Rynhard	f411491484	fix: stop leaking file descriptors This ensures that probed block devices are closed. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 17:15:54 -08:00
Andrew Rynhard	e81b3d11a8	feat: output machined logs to /dev/kmsg and file Since dmesg is not streamed, it becomes difficult to debug issues with machined. This fixes that by setting up the logging of machine to go to /dev/kmsg and to a log file. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 12:53:13 -08:00
Andrew Rynhard	3887b1e5b6	chore: force overwrite of output file This adds the force option to gzip. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 10:31:17 -08:00
Andrew Rynhard	eb0c8e9e4b	refactor: use constants.SystemContainerdNamespace This replaces hardcoded instances of "system" with constants.SystemContainerdNamespace. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 10:14:43 -08:00
Andrew Rynhard	3ce6f34995	feat: add timestamp to installed file This adds a timestamp to /boot/installed. It can be useful for determining the last known successful install. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 10:06:26 -08:00
Andrew Rynhard	45a3406fba	fix: send SIGKILL to hanging containers This addresses an issue caused by containers that refuse to exit with SIGTERM. After sending SIGTERM, we send SIGKILL after a timeout of one minute. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 09:44:49 -08:00
Andrew Rynhard	d15e226998	fix: be explicit about installs Trying to be smart about whether our not an install is being performed as part of an upgrade has proven to be error prone. This moves to perform installs with explicit args. Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>	2019-11-03 09:31:14 -08:00

1 2 3 4 5 ...

1091 Commits