Commit Graph

263 Commits

Author SHA1 Message Date
Andrew Rynhard
d4770d41ad feat: run installs via container
This moves to performing installs via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 15:01:20 -05:00
Spencer Smith
739e232896 feat: upgrade kubernetes to v1.16.0-beta.1
This PR will upgrade to the latest beta of v1.16 in order to get us
closer to catching the v1.16.0 release as soon as it drops.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-27 13:25:33 -04:00
Brad Beam
f028d29d31 chore: Increase timers for healthchecks
We've seen some instances where the initial delay is not long enough (containerd)
as well as a period of every second increases the log size for services like
proxyd which log incoming connections.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-27 09:54:05 -07:00
Andrew Rynhard
0bdaff1a90 feat: perform upgrades via container
This moves to performing upgrades via a container.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-27 09:44:50 -07:00
Spencer Smith
f85750cdca feat: generate and use v1 machine configs
This PR will implement the v1 machine config proposal. This will allow
for a streamlined config for talos nodes.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-26 19:36:14 -04:00
Andrew Rynhard
43e20217e8 feat: add ability to pass data on event bus
We need to support eventing with associated data. This moves the event
bus to an observer design pattern that allows observers to register for
specific events, and to receive the associated data.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-26 13:27:02 -07:00
Spencer Smith
6f8e089271 chore: use kubeadm v1beta2 structs everywhere
This PR will move to using the external kubeadm v1beta2 structs for our
code base. This will hopefully allow for more stable integrations with
kubeadm in the long term, as well as solve some needs we have in the
machine config rewrite.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-08-26 12:07:36 -04:00
Brad Beam
692571bdec feat(networkd): Add grpc endpoint
Allows us to list routes and interface details

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-25 19:48:08 -07:00
Brad Beam
d36007fb29 feat(osd): Add ntpd client
Allows us to access ntp api

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-25 13:38:34 -07:00
Andrew Rynhard
9eaa2d8140 feat: add sequencer interface
This adds an interface that can be used to descibe boot, shutdown, and
upgrade events in a set of phases.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-25 12:59:42 -07:00
Andrew Rynhard
be8f58c15d feat: add overlay task
This adds a well defined task for handling all overlay mount points that
are required by the system.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-25 10:47:54 -07:00
Andrew Rynhard
1eb02875c2 feat: use BLKPG ioctl for partition events
This moves to using BLKPG ioctl instead of BLKRRPART. BLKRRPART is older
and more sensitive to EBUSY errors. BLKPG has the potential to minimize
the changes of encountering an EBUSY error when manipulating partition
tables.

In looking at a comparison between BLKPG and BLKRRPART, it seems that
both have their pros and cons. Eventually a combination of the two may
serve us better, but for now I think BLKPG will get us further.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-25 07:55:24 -07:00
Brad Beam
cdc989ddda refactor(networkd): Switch from rtnetlink to rtnl
Gives a better abstraction on rtnetlink interaction

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-21 13:24:51 -05:00
Brad Beam
313c118ad0 refactor(networkd): Replace networkd with a standalone app
This is a major rewrite of our network subsystem.

- This changes networkd to run as a standalone app versus internal goroutine
- This changes out the netlink package with the more idiomatic netlink/rtnetlink
  packages
- This changes the initial network bootstrap/discovery from using a single
  interface to attempting to bring up all interfaces
- This moves us back on to the upstream dhcp library

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-21 13:24:51 -05:00
Andrew Rynhard
0af1eba159 refactor: add more runtime modes
In order to DRY up all installation methods and mount methods, this PR
introduces a few more runtime modes. The modes are then used to
determine the strategy for creating and or mounting the paritions.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-19 20:23:45 -07:00
Andrew Rynhard
794c7231f5 feat: run dedicated instance of containerd for system services
In order to facilitate upgrades and resets that are capable of
manipulating the system block device, we need to run an instance of
containerd that has zero dependencies on the disk. We run containerd
purely in memory for running system services.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-19 12:32:59 -07:00
Andrew Rynhard
2e65cff3ce feat: mount /sys/fs/bpf
The BPF filesystem is required to pin BPF objects.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-18 07:37:08 -07:00
Brad Beam
ec0f188309 fix(machined): Remove host mounts for specific CNI providers
We shouldnt need these anymore

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-17 20:20:45 -07:00
Andrew Rynhard
e305acac20 feat: add standardized command runner
This adds a command runner function that can be used everywhere we need
to exec a binary. It adds addtional logic around error handling that
will allow for viewing errors in the case of a failed command.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-17 03:38:36 -07:00
Brad Beam
cf64847772 refactor(proxyd): Update multilisteners to use error chan.
This cleans up the multiple listener implementation.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-16 12:21:02 -05:00
Andrew Rynhard
6940aaf233 fix: verify installation definition
This fixes the possibility of panicing on a nil pointer by running the
verification steps earlier.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-16 09:58:12 -07:00
Brad Beam
76a9c15044 feat: Add gRPC server for ntp
Part of the API refactor; this introduces a gRPC server for ntp.
This allows the ability to query node time and check time against
specific ntp servers.

This refactor also moves the ntp functionality into a sub package for
better project organization.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-16 09:46:43 -07:00
Brad Beam
46c283b6c9 chore: Disable rate limited kmessage
This should allow for better troubleshooting during early boot/startup

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-16 09:36:52 -07:00
Brad Beam
70a478895f feat(proxyd): Add gRPC server
Part of the API refactor; this introduces a gRPC server for proxyd
to expose some of the internal state.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-15 16:35:03 -05:00
Andrew Rynhard
a116145c1b feat: rename DATA partition to EPHEMERAL
This changes the data partition name to something more appropriate. We
chose ephemeral to make it very clear that the disk should not be used
for application data.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-15 08:00:22 -07:00
Brad Beam
249acda74a feat: Allow hostname to be specified in userdata
This sets up the ability to define hostname via userdata. I dont expect
this will get used publicly much, but provides a mechanism to convey
the hostname from various sources internally.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-14 22:41:27 -05:00
Andrew Rynhard
09693a26c9 chore: update go modules to use Kubernetes v1.16.0-alpha.3
This is not ideal, but it works. We essentially need to start using
replace statements in order to pull in the modules we need.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-14 15:34:09 -07:00
Andrew Rynhard
142500ce3e fix(proxyd): print bootstrap backend dial errors
This prints any error that occurs when dialing the bootstrap backend.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-11 15:12:09 -07:00
Seán C McCord
fd76d90028 fix(proxyd): do not pre-bracket IPv6 backend addrs
Fixes #996

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-08-11 15:00:22 -07:00
Andrew Rynhard
ad79e8dfcf feat: remove the machine config on reset
This wil remove the machine config on a reset so that a new machine
configwill be downloaded and used on a reboot.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-11 12:51:55 -07:00
Seán C McCord
63cfd8a405 fix(proxyd): wrap Dial addresses
Handle IPv6 addresses in proxyd frontend.

Fixes #988

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-08-10 23:00:28 -07:00
Seán C McCord
7691bb060c fix: enable IPv6 forwarding
Fixes #985

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-08-10 22:39:56 -07:00
Seán C McCord
6d22744eca fix: store PartitionName when on NVMe disk
Fixes #978

Signed-off-by: Seán C McCord <ulexus@gmail.com>
2019-08-10 17:10:01 -07:00
Andrey Smirnov
ae54f7e40d fix: stalls in local Docker cluster boot
Problem was triggered by udevd trigger, root cause is not clear, but
workaround is to disable it for container mode.

Implement CPU/mem limits for `osctl cluster create`, apply defaults,
bump defaults for cicd.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-10 13:31:47 +03:00
Brad Beam
da1f73249f fix(machined): Clean up installation process
This also includes a fix for #955 which had the unintended side effect
of breaking image creation ( since it would attempt to grow the filesystem
always ).

The refactor standardizes around looking for the DATA and ESP labels to
discover any existing installations/filesystems. If none are found, an
installation will proceed -- for both image creation and bare metal.
During bootup, the DATA partition will always attempt to expand/grow.

This also introduces a new phase to verify the installation through the
existance of /boot/installed ( migrated from install stage ).

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-08 22:10:14 -05:00
Brad Beam
53b1330c44 fix(initramfs): Allow data partition to grow
This fix ensures that we always grow the data partition during an installation.

Signed-off-by: Brad Beam <brad.beam@talos-systems.com>
2019-08-07 09:11:02 -05:00
Andrey Smirnov
80f2d62958 chore: stabilize one more health test
Same approach: attempt more retries to fight general slowness/resource
starvation.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-06 02:45:00 +03:00
Andrey Smirnov
2f0698def2 chore: stabilize health test
It was failing randomly due to Sleep being insufficient for the desired
condition being reached.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-02 14:04:03 -07:00
Andrey Smirnov
8362f58e7a chore: fix data race in goroutine runner
Discovered with `go test -race`:

```
WARNING: DATA RACE
Read at 0x00c0000cf2f8 by goroutine 25:
  github.com/talos-systems/talos/internal/app/machined/pkg/system/runner/goroutine.(*goroutineRunner).Stop()
      /home/smira/Documents/autonomy/talos/internal/app/machined/pkg/system/runner/goroutine/goroutine.go:111 +0x3e
  github.com/talos-systems/talos/internal/app/machined/pkg/system/runner/goroutine_test.(*GoroutineSuite).TestStop()
      /home/smira/Documents/autonomy/talos/internal/app/machined/pkg/system/runner/goroutine/goroutine_test.go:115 +0x345
  runtime.call32()
      /usr/local/go/src/runtime/asm_amd64.s:519 +0x3a
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:308 +0xc0
  github.com/stretchr/testify/suite.Run.func2()
      /home/smira/Documents/go/pkg/mod/github.com/stretchr/testify@v1.3.1-0.20190311161405-34c6fa2dc709/suite/suite.go:133 +0x2ec
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:865 +0x163

Previous write at 0x00c0000cf2f8 by goroutine 26:
  github.com/talos-systems/talos/internal/app/machined/pkg/system/runner/goroutine.(*goroutineRunner).Run()
      /home/smira/Documents/autonomy/talos/internal/app/machined/pkg/system/runner/goroutine/goroutine.go:65 +0xcb
  github.com/talos-systems/talos/internal/app/machined/pkg/system/runner/goroutine_test.(*GoroutineSuite).TestStop.func3()
      /home/smira/Documents/autonomy/talos/internal/app/machined/pkg/system/runner/goroutine/goroutine_test.go:104 +0x4a
```

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-02 14:03:18 -07:00
Andrey Smirnov
37c1703f06 chore: add tests for event.Bus
Small tests to make sure code works as expected.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-02 14:02:18 -07:00
Andrey Smirnov
71640662e0 chore(init): rearrange phase handling to push shutdown to main
This re-arranges phases a bit so that shutdown actions are pushed back
to the top-level main.go of machined.

Small rudimentary event.Bus is introduce to facilitate event passing
(shutdown/restart) between various machined components and main.go. This
might be not the best implementation, just something to allow this
message passing without global variables or such.

Machined API was refactored to run as goroutine service.

ACPI & signal handlers re-built as phase tasks, and activated for
non-container, container modes respectively.

As part of the fix, now `docker stop` triggers correct shutdown of Talos
(not a big deal, but good for testing).

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-02 08:42:12 -07:00
Andrew Rynhard
90c91807bd refactor: restructure the project layout
This change moves packages into more appropriate places.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 22:19:42 -07:00
Andrew Rynhard
a9c4a95a4b fix: mount the owned partitions in cloud platforms
This adds the logic for mounting the owned block device and resizing the
ephemeral partition for cloud platforms.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 21:48:23 -07:00
Andrew Rynhard
ca35b85300 refactor: improve installation reliability
This change aims to make installations more unified and reliable. It
introduces the concept of a mountpoint manager that is capable of
mounting, unmounting, and moving a set of mountpoints in the correct
order.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 11:44:40 -07:00
Andrey Smirnov
9c63f4ed0a feat(init): implement complete API for service lifecycle (start/stop)
It is now possible to `start`/`stop`/`restart` any service via `osctl`
commands.

There are some changes in `ServiceRunner` to support re-use (re-entering
running state). `Services` singleton now tracks service running state to
avoid calling `Start()` on already running `ServiceRunner` instance.
Method `Start()` was renamed to `LoadAndStart()` to break up service
loading (adding to the list of service) and actual service start.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-08-01 11:16:57 -07:00
Andrew Rynhard
835d72b74a fix: create overlay mounts after install
Without running the install task first, /var is read-only. This causes
the overlay phase to fail as it tries to create /var/system.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-08-01 06:35:12 -07:00
Andrey Smirnov
084378ac04 fix(init): flip concurrency of tasks/services, fix small issues
Phases should run sequentially, while tasks concurrently in a phase.

There are two potential issues fixed:

1. `result` multierror was updated inside goroutine without any
synchronization, so this is a data race
2. panic inside task/phase runner might happen and as unhandled panic in a
goroutine aborts whole process, this might lead to a system halt as
as the 'machined' exits

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-07-31 14:21:07 -07:00
Spencer Smith
bc5fe085bd fix: set mtu value regardless of interface state
This PR will fix a bug we encountered in GCE, where the interface was
already up and the MTU value wasn't getting set.

Signed-off-by: Spencer Smith <robertspencersmith@gmail.com>
2019-07-31 15:02:02 -04:00
Andrew Rynhard
e63c882b89 refactor: split machined into phases
This change aims to standardize the boot process. It introduces the
concept of a phase, which is comprised of tasks. Phases are ran in serial and
the tasks that make up a phase are ran concurrently.

Signed-off-by: Andrew Rynhard <andrew@andrewrynhard.com>
2019-07-29 12:40:03 -07:00
Andrey Smirnov
f56a9d5b96 chore: implement first version of CRI runner
It runs containers via CRI interface in a pod sandbox. This is the very
first version:  I tried not to introduce any changes to common runner
interface.

There should be some CRI-speficic options for the runner (like polling
interval, as it doesn't have nice `Wait()` API), plus my plan so far is
to use OCI as the common layer for container options, so that we can
analyze OCI and translate to CRI (when possible, return errors when
option is not implemented).

CRI interface doesn't have a concept of 'unpacking' an image, so we
probably need to unpack via containerd API (or any other
runtime-specific API) by targeting CRI namespace.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
2019-07-26 21:07:46 +03:00