269 Commits

Author SHA1 Message Date
Andrey Smirnov
55f3249783 test: use registry mirrors in CI
This relies on registry caching mirrors running in the CI.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-31 16:30:41 +03:00
Andrey Smirnov
58aa2b75bb test: destroy clusters in e2e tests (qemu/firecracker)
As the build runs inside containers which are part of a single pod, we
need to clean up networking bits (bridge interface, etc.), so that it
doesn't cause problems for other steps.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-31 06:21:09 -07:00
Andrey Smirnov
a5d64d97c1 test: update qemu/firecracker provisioners
Fixes #2363 #2364 #2370 #2371

Several changes packed together:

* use compressed `vmlinuz` everywhere, firecracker provisioner
uncompresses it before first use, drop `vmlinux`

* handle reboots in qemu launcher to support reset API case, update
empty disk check to handle reset behavior (erasing partition table)

* make bootloader support default in provisioners, and flag to disable
that

* early support for target architecture for qemu provisioner

This should allow us to use `qemu` in CI/CD (not included into this PR):
integration test passes with qemu.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-30 21:17:25 +03:00
Artem Chernyshev
c6eb18eed5 feat: qemu provisioner
Starts and stops qemu VMs, has some initial configuration subset.
Sets up networking through CNI tools, sets up DHCP server which gives IP
addresses to nodes.

Signed-off-by: Artem Chernyshev <artem.0xD2@gmail.com>
2020-07-28 14:55:35 -07:00
Andrew Rynhard
6f5d24cc3d chore: add release notes
This ensures that releases have notes.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-28 14:14:33 -07:00
Andrey Smirnov
6a81f30941 test: provide node discovery for cli tests via kubectl
Fixes #2330

CLI tests require node discovery as `--nodes` flag is enforced for most
of the `talosctl commands`.

For clusters created via `talosctl cluster create`, cluster provisioner
state provides all the necessary information, but clusters created via
CAPI don't have the state attached.

API tests rely on Talos and Kubernetes APIs to fetch kubeconfig and
access Nodes K8s API.

CLI tests should rely only on CLI tools, so we use `kubectl get nodes` +
`talosctl kubeconfig` to fetch list of master and worker nodes.

This discovery method relies on "bootstrap" node being set in
`talosconfig` (to fetch `kubeconfig`).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 11:35:47 -07:00
Andrey Smirnov
76c44ac468 test: remove apid load balancer for firecracker
We're not using load balancer for `apid` (always using client-side load
balancing), so we can remove this safely.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-28 20:21:21 +03:00
Andrew Rynhard
1f31d24e55 chore: use Kubernetes pipelines
This moves to using Kubernetes pipelines.

Signed-off-by: Andrew Rynhard <andrew@rynhard.io>
2020-07-27 12:09:53 -07:00
Andrey Smirnov
3d8418a689 feat: force nodes to be set in talosctl commands using the API
With load-balancing enabled by default running `talosctl` without
`--nodes` is risky, as it might hit any control plane by default without
`--nodes`.

Only two commands do not enforce this check, as they do their own node
contexts: `crashdump` and `health` (client-side).

Integration tests were updated to always supply `--nodes` cli argument,
while doing that I refactored the storage for discovered nodes to use
existing `cluster.Info` interface.

The downside is that with e2e CAPI tests CLI tests will be mostly
skipped as we don't support discovery in CLI tests at the momemnt. This
can be fixed by using `talosctl kubeconfig` + `kubectl get nodes` for
node discovery.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-21 12:17:43 -07:00
Spencer Smith
f290f88160 chore: update clusterctl for CI testing
This PR brings in the latest version of clusterctl that has built-in
support for the talos repos. I'll be chasing this with a move to using
the control-plane provider as well!

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-15 19:33:59 -04:00
Andrey Smirnov
9590030a84 feat: print crash dump in talosctl cluster create on failure
When cluster fails to be bootstrapped or it fails the health check, it's
hard to find the root cause without the logs.

This change adds optional crashdump (it dumps firecracker logs or docker
logs) after provisioning failure. It's not enabled by default.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-10 11:54:07 -07:00
Andrey Smirnov
4f5660b22b test: fix sonobuoy delete
It expects kubeconfig as required argument.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-09 18:46:57 +03:00
Spencer Smith
67cddaff44 chore: wait for resource deletion in sonobuoy
This PR fixes the fix where we try to cleanup sonobuoy. We did that
successfully, but still got errors b/c we were immediately trying to
create service accounts in a namespace that was being deleted. This
should fix that. The sonobuoy default wait period is 1hr, should be
plenty.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-07 10:58:47 -07:00
Spencer Smith
13bd77355e chore: cleanup sonobuoy after failed attempts
This PR will make sure that, if we're going to retry sonobuoy, we run
the delete command first to clean up any dangling resources.

Closes #2266.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-07-06 11:46:49 -07:00
Andrey Smirnov
b4abab0ed0 test: run integration pipeline nightly
Just a copy of `integration` pipeline with the same trigger as `nightly`
pipeline, so we can have two separate pipelines and two notifications
for better visibility.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-02 08:22:33 -07:00
Andrey Smirnov
0ea9a0a5ea test: run e2e-firecracker-short for default pipeline only
As many pipelines inherit steps from `default_steps`, take out
`e2e-firecracker-short` from `default_steps`.

`e2e` pipeline only relies on `e2e-docker`.

`integration` pipeline does full firecracker run with `e2e-firecracker`.

`release` pipeline manually pulls in `e2e-firecracker-short` to be on
the safe side.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 13:16:25 -07:00
Andrey Smirnov
3ae5e0e749 test: add short integration test with custom CNI
This adds new flug to `cluster create` to launch cluster with custom
CNI, `integration` pipeline gets a new step to run short test with
Cilium 1.8.0 CNI.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 11:19:19 -07:00
Patatman
90acb01a4e docs: digital rebar docs
Digital rebar docs in the guide section.

Signed-off-by: Patatman <git@jeursen.nl>
2020-06-30 18:52:39 -07:00
Andrey Smirnov
e46a09f56a chore: make default pipeline run shorter integration test
This moves full integratation test and provision tests to
the `integration` pipeline.

Docker test wasn't affected much, as anyways docker can't run long
integration tests, so it mostly affects firecracker and provision tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-07-01 00:14:55 +03:00
Andrey Smirnov
506c118710 chore: bring back tmp volume shared from e2e-docker to CAPI steps
CAPI-based steps are using docker Talos cluster built at `e2e-docker`
stage as a bootstrap cluster. Share the config via volume which is
attached to specific steps.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-29 22:16:05 -07:00
Andrey Smirnov
9cd7fa29f0 chore: stop mounting /tmp for the build pipeline
e2e tests are running in `/tmp/e2e` directory, so all the firecracker VM
virtual disk are going to `/tmp` directory. As `/tmp` was mounted as
`tmpfs`, this was putting high pressure on build host memory. Memory is
also used for docker containers, firecracker VMs, etc. Build host has
fast NVMe disks, so no good reason to keep `/tmp` in memory.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-06-26 14:04:59 -07:00
Andrew Rynhard
d0d2ac3c74 test: default to using the bootstrap API
This moves our test scripts to using the bootstrap API. Some
automation around invoking the bootstrap API was also added
to give the same ease of use when creating clusters with the
CLI.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-24 08:46:10 -07:00
Andrew Rynhard
ded76758c8 chore: run provision tests in parallel
This change makes the provision tests run in parallel to the
e2e tests.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-06-16 20:34:17 -07:00
Spencer Smith
e03a68f8eb feat: update k8s and sonobuoy versions
This PR will update k8s to the latest 1.18 release and bump sonobuoy to
help resolve some e2e flakes. Also adds some retry logic around the
sonobuoy run.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-06-10 06:47:36 -07:00
Andrey Smirnov
2fb00344ab chore: upgrade Go to 1.14.3 and use toolchain for race detector
With Go 1.14.3 we can run race-enabled code on muslc, so this opens path
to run unit-tests-race under Talos environment with rootfs, enabling all
the tests to run under race detector.

Also fixed the tests run by specifying platform in the test environment.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-25 08:35:11 -07:00
Spencer Smith
6383a78065 chore: serialize firecracker e2e tests
This PR will ensure that the firecracker provision tests will only run
after a successful e2e_firecracker run. This is being added in hopes of
freeing up some resources during CI testing and making things more
stable.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-05-11 12:25:14 -07:00
Andrey Smirnov
23be80fd96 test: stabilize tests by bumping timeouts
Bump timeouts for reset API test as K8s control plane teardown might
take 3 minutes on its own.

Bump Go Firecracker SDK timeout when talking to firecracker process.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-05-06 08:26:18 -07:00
Spencer Smith
c1b6f05b00 chore: use clusterctl and v1alpha3 providers for tests
This PR will update our testing ocde to make use of the clusterctl tool,
as well as use the newer versions of various providers and updated
manifests.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-05-01 07:42:19 -07:00
Andrew Rynhard
8af77c0f3d release(v0.5.0-alpha.2): prepare release
This is the official v0.5.0-alpha.2 release.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-28 09:35:44 -07:00
Andrew Rynhard
3332ca58d3 release(v0.5.0-alpha.1): prepare release
This is the official v0.5.0-alpha.1 release.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-21 11:52:29 -07:00
Andrew Rynhard
5f996e737d chore: use a single CHANGELOG
Instead of keeping a CHANGELOG for each release in the master branch, a
single CHANGELOG should be used since it will move into release branches
anyways. This prevents us from having to keep the files in sync across
master and the release branch. This also adds better tooling for
generating the CHANGELOG.md.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-17 11:24:48 -07:00
Andrew Rynhard
4ccd4d5364 fix: set ephemeral partition to max size
This sets the size of the ephemeral partition to the maximum
allowed size at installation time. We have reports of `xfs_growfs` causing
extremely slow boot times when the disk is 1TB or more. In our research
we found evidence that `xfs_growfs` is an expensive operation when
growing to a size of 10 times or more of the base. Instead, users should
create the disk close to the max disk size at install time. The
difference being that `mkfs.xfs` will handle larger disks better.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-17 07:08:04 -07:00
Spencer Smith
8d2f8d6127 chore: remove random.trust_cpu references
This PR removes the references to adding in the random CPU trust to the
kernel for all v0.4 docs, as well as in the iso command in the
installer. This is no longer needed with the newer linux kernel.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-04-14 17:10:56 -07:00
Andrew Rynhard
a10acd592a chore: address random CI nits
This PR does the following:

- updates the conform config
- cleans up conform scopes
- moves slash commands to the talos-bot
- adds a check list to the pull request template
- disables codecov comments
- uses `BOT_TOKEN` so all actions are performed as the talos-bot user
- adds a `make conformance` target to make it easy for contributors to
check their commit before creating a PR
- bumps golangci-lint to v1.24.0

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-04-13 13:01:14 -07:00
Andrey Smirnov
2d5c6f4c10 test: serialize docs step execution
`make docs` removes and then regenerates contents of some docs, so it
might cause random `-dirty` issue when running concurrently with build
steps.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-04-07 23:46:16 +03:00
Spencer Smith
3a4eaeeef0 feat: upgrade kubernetes to 1.18
This PR will pull in the latest release of k8s 1.18 so we can start
validating it through our test suite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-26 14:59:43 -04:00
Andrey Smirnov
104af4380e feat: make --wait default option to talosctl cluster create
It seems to be useful enough to be the default one and it prevents
simple mistakes while trying to access the cluster which is not ready
yet.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-25 06:36:43 -07:00
Andrey Smirnov
e38cde9b48 chore: update upgrade tests for new version, split into two tracks
This updates upgrade tests to run two flows with 3+1 clusters:

1. 0.3 -> current (testing upgrade with partition wiping)
2. 0.4-alpha.7 -> current (testing upgrade without partition wiping,
boot-a/boot-b)

And small upgrade with preserve enabled for single-node cluster.

Provision tests are now split into two parallel tracks in Drone.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-24 15:30:00 -07:00
Spencer Smith
3485ea9f09 fix: update k8s to 1.17.3
This PR will update k8s to v1.17.3 to address CVEs mentioned in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/kubernetes-security-announce/2UOlsba2g0s

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-03-23 17:08:52 -07:00
Andrew Rynhard
c6581fabac feat: build talosctl for ARM v7
This adds an ARM v7 build of `talosctl`.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-21 18:35:00 -07:00
Andrew Rynhard
43662e4a24 feat: build talosctl for ARM64
This adds an ARM64 build of `talosctl`.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-21 16:40:52 -07:00
Andrew Rynhard
5dbc26c7a3 feat: rename osctl to talosctl
This is a rename of the osctl binary. We decided that talosctl is a
better name for the Talos CLI. This does not break any APIs, but does
make older documentation only accurate for previous versions of Talos.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-03-20 19:07:39 -07:00
Andrey Smirnov
2e3681054d chore: improve handling of etcd responses in bootkube pre-func
Try more attempts, wait for the response. Treat empty response as no
error (as this is what to expect when key is not set yet).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-03-06 21:06:48 +03:00
Andrey Smirnov
d5d3035c8c test: enable upgrade tests 0.4.x -> latest
With the fix #1904, it's now possible to upgrade 0.4.x with
`machine.File` extra files (caused by registry mirror for
registry.ci.svc).

Bump resources for upgrade tests in attempt to speed it up.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-26 00:09:32 +03:00
Andrey Smirnov
923ef4537b test: implement new class of tests: provision tests (upgrades)
This class of tests is included/excluded by build tags, but as it is
pretty different from other integration tests, we build it as separate
executable. Provision tests provision cluster for the test run, perform
some actions and verify results (could be upgrade, reset, scale up/down,
etc.)

There's now framework to implement upgrade tests, first of the tests
tests upgrade from latest 0.3 (0.3.2 at the moment) to current version
of Talos (being built in CI). Tests starts by booting with 0.3
kernel/initramfs, runs 0.3 installer to install 0.3.2 cluster, wait for
bootstrap, followed by upgrade to 0.4 in rolling fashion. As Firecracker
supports bootloader, this boots 0.4 system from boot disk (as installed
by installer).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-21 07:04:03 -08:00
Andrey Smirnov
5f330f1f64 chore: push installer & talos images to the CI registry on every build
This enables a way to run the matching installer image in firecracker
tests. New image is used in firecracker tests and bootloader support to
use installed kernel/initramfs, which opens path for upgrade tests.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-18 07:32:45 -08:00
Andrew Rynhard
c9a8605f87 chore: move golangci-lint.yaml to .golangci.yml
This allows local runs of golangci-lint to use the default config path.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2020-02-18 07:10:21 -08:00
Andrey Smirnov
f51e9a14fe chore: build app container images skipping export to host
Container images for `apid`, `networkd`, etc. are now built inside the
buildkit using the `img` tool. This means that all the dependencies are
now controlled in `buildkit` and many more stages can run in parallel
without problems (overwriting content in `_out/images`).

This also simplifies Drone configuration, as we can let buildkit handle
the dependencies. I also enabled more stages to run in parallel.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2020-02-14 13:17:25 -08:00
Spencer Smith
1d73a9e6d1 chore: only run ok-to-test when PR
This PR fixes a quick bug in CI where the ok-to-test step in drone was
running after a merge to master.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-02-04 10:27:46 -08:00
Spencer Smith
c825b83d47 chore: support slash commands in drone
This PR adds the necessary drone step to check for the `ok-to-test`
label before running any testing against a PR.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2020-02-04 12:57:16 -05:00