5
0
mirror of git://git.proxmox.com/git/pve-ha-manager.git synced 2025-01-20 18:03:53 +03:00

730 Commits

Author SHA1 Message Date
Thomas Lamprecht
800a2de6a3 FenceConfig: early return if file is empty
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-13 12:45:45 +01:00
Thomas Lamprecht
1c2561110f d/lintian-overrids: add repeated-trigger-name override
in this package we provide api functions, thus we want to activate
the pve-api-update trigger, so that packages like pve-manager get
notified about it. But we also use api functions directly so we setup
an interest in the pve-api-update trigger. This results in an lintian
error (lintian version from buster or newer) which we can override:

> [...]
> This tag is also triggered if the package has an activate trigger
> for something on which it also declares an interest. The only (but
> rather unlikely) reason to do this is if another package also
> declares an interest and this package needs to activate that other
> package. If the package is using it for this exact purpose, then
> please use a Lintian override to state this.
-- https://lintian.debian.org/tags/repeated-trigger-name.html

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-08 17:35:59 +01:00
Thomas Lamprecht
3220c3391c sim: show sent emails in regression tests
its good to check if any regression regarding sendmail happened, as
it can be annoying if a sendmail loop happens.
2019-01-08 17:32:05 +01:00
Thomas Lamprecht
7488b3cc2c fence config: allow to pass arguments to fence agents via short-opts
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-08 15:28:06 +01:00
Thomas Lamprecht
a57a3b7809 d/control: add missing pve-container dependency
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-08 15:23:56 +01:00
Thomas Lamprecht
7583bf275c fencing: fixup run_fence_jobs
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-08 15:23:04 +01:00
Thomas Lamprecht
7655c92c81 fixup changelog line length and typos
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 13:35:34 +01:00
Thomas Lamprecht
e3e02f4688 bump version 2.0-6
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 13:00:00 +01:00
Wolfgang Bumiller
0354cbe945 fixup parse_sid call
This call was missed in the commit moving it from
PVE::HA::Tools to PVE::HA:Config.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 0087839aa530 ("Tools: remove dependency on PVE::Cluster")
2019-01-07 12:10:40 +01:00
Thomas Lamprecht
d2236278ac followup code cleanup
addresses a few nits from Fabians review at:
https://pve.proxmox.com/pipermail/pve-devel/2018-December/035061.html
https://pve.proxmox.com/pipermail/pve-devel/2018-December/035085.html

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 12:07:05 +01:00
Thomas Lamprecht
7a20d688d8 lrm: explicitly log shutdown_policy on node shutdown
Makes regression test a bit more telling and it helps to be verbose
for an user here too.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 11:17:30 +01:00
Thomas Lamprecht
ba15a9b908 fix #1378: allow to specify a service shutdown policy
Allow an admin to set a datacenter wide HA policy which can change
the way we handle services on a node shutdown.

There's:

* freeze: always freeze servivces, independent of the shutdown type
  (reboot, poweroff)
* failover: never freeze services, this means that a service will get
  recovered to another node if possible and if the current node does
  not comes back up in the grace period of 1 minute.
* default: this is the current behavior, freeze on reboot but do not
  freeze on poweroff

Add to tests, shutdown-policy1 which is based of the reboot1 test,
but enforces no freeze with a failover policy, and shutdown-policy2
which is based on the shutdown1 test but with a explicit freeze
policy. You can compare (diff) each tests log result to the test it's
based on to see what changes.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 11:17:30 +01:00
Thomas Lamprecht
ed408b4491 Env: add get_ha_settings method
Add get_ha_settings, a method which returns the datacenter wide HA
settings

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-01-07 11:17:30 +01:00
Rhonda D'Vine
b9350791a3 Add missing Build-Depends
Signed-off-by: Rhonda D'Vine <rhonda@proxmox.com>
2018-12-17 09:41:11 +01:00
Thomas Lamprecht
c974828745 install simulator executable into bin not sbin
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-10-17 11:51:04 +02:00
Thomas Lamprecht
1e07d70c29 Tools: add note about indirect include of Config module
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-10-17 11:41:44 +02:00
Fabian Grünbichler
728d9a2a97 build: actually ship SOURCE file
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-10-17 11:20:41 +02:00
Fabian Grünbichler
6ea95574cc build: bump compat level to 10
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-10-17 11:20:41 +02:00
Fabian Grünbichler
1116ca25b8 build: restructure packaging
use dpkg-buildpackage and debhelper properly, add missing dependencies and
embed used perl modules from libpve-common-perl to make pve-ha-simulator
standalone.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-10-17 11:20:41 +02:00
Fabian Grünbichler
0087839aa5 Tools: remove dependency on PVE::Cluster
by moving parse_sid to PVE::HA::Env, with the default implementation in
PVE::HA::Config.

the bash completion methods use PVE::HA::Config (and PVE::Cluster), but
the corresponding use statements are only in PVE::CLI::ha_manager, where the
bash completion is actually used.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-10-17 11:20:41 +02:00
Fabian Grünbichler
6529b6a4e2 Tools/Config: refactor lrm status json reading
to avoid unnecessary dependency on PVE::Cluster in PVE::HA::Tools.

reading the LRM status file was the only instance of reading from the
CFS via this method.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-10-17 11:20:41 +02:00
Fabian Grünbichler
5f52cd3c42 sim: don't install PVE::HA::Config
it is not needed anymore by the simulator.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 15:26:26 +02:00
Fabian Grünbichler
dd970f9ea6 sim: don't install real resources
they are not needed, the simulator contains its own (simulated)
resources.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 15:26:21 +02:00
Fabian Grünbichler
7d33cb12de groups: register groups directly
and use PVE::HA::Groups to parse the config when testing/simulating.

this allows us to drop the dependency on PVE::HA::Config, which would
otherwise pull in a lot of additional depdendencies that we don't want
in the simulator.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 14:06:59 +02:00
Fabian Grünbichler
f503a7bf77 pve-ha-tester: use correct lib path
since we want to test the version from the current working tree, and not
the installed one.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 14:06:59 +02:00
Fabian Grünbichler
e649331eab remove unused use statements
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 14:06:59 +02:00
Fabian Grünbichler
745fd425c4 build: remove leftover PHONY declaration
simdeb is already declared PHONY on its own

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
2018-09-28 14:06:59 +02:00
Dominik Csapak
2799edd464 document api result for ha resources
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2018-09-17 12:43:54 +02:00
Thomas Lamprecht
c253924fd3 bump version to 2.0-5
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-02-07 11:20:28 +01:00
Thomas Lamprecht
9cdf16b1c8 buildsys: use correct git revision for SOURCE file 2018-02-07 10:38:51 +01:00
Thomas Lamprecht
724bd3f311 do not do active work if cfs update failed
We ignored if the cluster state update failed and happily worked with
an empty state, resulting in strange actions, e.g., the removal of
all (not so) "stale" services or changing the all but the masters
node state to unknown.

Check on the update result and if failed, either do not get active,
or, if already active, skip the current round with the knowledge
that we only got here because the update failed but our lock renew
worked => cfs got already in a working and quorate state again -
(probably just a restart)

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> 
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
3df1538094 move cfs update to common code
We updated the CRM and LRM view of the cluster state only in the PVE2
environment, outside of all regression testing and simulation scope.

Further, we ignored if this update failed and happily worked with an
empty state, resulting in strange actions, e.g., the removal of all
(not so) "stale" services or changing the all but the masters node
state to unknown.

This patch tries to improve this by moving out the update in a own
environment method, cluster_update_state, calling this in the LRM and
CRM and saving its result.
As with our introduced functionallity to simulate cfs rw or update
errors we can also simulate failures of this state update with the RT
system.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> 
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
da6f041699 move start/end hooks to common code
We called them at similar times anyways, and have them under the
regression test cover with this change.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> 
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
ada4b9a830 Revert "wrap possible problematic cfs_read_file calls in eval"
This reverts commit bf7febe3771d6f9a2aef97bcd6eab4ece098c5aa.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> 
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
30b4f397a0 CRM: refactor check if state transition to active is ok
Mainly addresses a problem where we read the manager status without
catching any possible exceptions.

As this was done only to check if our node has active fencing jobs,
which tells us that it makes no sense to even try to acquire the
manager lock - as we're fenced soon anyway.
Besides this check we always checked if we're quorate and if there
are services configured, so move
both checks in the new 'can_get_active' method, which replaces the
check_pending_fencing and the has_services method.

Move the quorum check in front and catch a possible error from the
following manager status read.
As a side effect the state transition code gets a bit shorter without
hiding the check intention.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> 
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
8e940b68f9 lrm: handle an error during service_status update
we may get an error here if the cluster filesystem is (temporarily)
unavailable here, this error resulted in stopping the whole CRM
service immediately, which then triggered a node reset (if happened
on the current master), even if we had still time left to retry and
thus, for example, handle a update of pve-cluster gracefully.

Add a method which wraps the status read in an eval and logs an
eventual error, but does not abort the service. Instead we rely on
our get_protected_ha_agent_lock method to detect a problem and switch
to the lost_agent_lock state.

If the pmxcfs outage was really short, so that the manager status
read failed but the lock update worked again we update also always
before doing real work when in the 'active' state. If this update
fails we return from the eval and try next round again, as no point
in doing anything without consistent state.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:33:16 +01:00
Thomas Lamprecht
ba2a45cd9d test/sim: allow to simulate cfs failures
Add simulated hardware commands for the cluster file system.

This allows to tell the regression test or simulator system that a
certain nodes calls to methods accessing the CFS should fail, i.e.,
die.
With this we can cover a situation which mainly happen during a
cluster file system update.

For now allow to define if the CFS is read-/writeable (state rw) and
if updates of the CFS (state update) should work or fail.

Add 'can read/write' assertions all over the relevant methods.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2018-01-30 09:31:03 +01:00
Thomas Lamprecht
3166752f13 postinst: use auto generated postinst
This was introduced for cleaning up an possible left over systemd
watchdog mux enable link, which is gone for good now.

Then it was extended with trigger targets, as the HA Manager services
now restart when the pve-api-update trigger fires.
As the autogenerated postinst does the same unconditionally for the
pve-ha-lrm.service and pve-ha-crm.service already we may remove it
too.
The only difference is that try-restart is used by the auto generated
script, not reload-or-try-restart, but this does not matter, as the
HA services have currently no reload ability.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-26 09:37:22 +01:00
Thomas Lamprecht
c122969ff2 postinst: we do not use templates, remove debconf
This was copied by accident when adding the transitional code for
removing the left over of the systemd managed watchdog mux in
commit f8a3fc80af299e613c21c9b67e29aee8cc807018

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-26 09:37:22 +01:00
Thomas Lamprecht
e2c96fdae4 postinst: drop transitional systemd watchdog mux socket cleanup
This transitional code was added first with
commit f8a3fc80af299e613c21c9b67e29aee8cc807018
and fixed up with
commit ecc145c9724f056549e5458f17d7714ac8c83459
during Proxmox VE 4.1 and 4.2 to remove the problematic systemd
managed watchdog mux socket.

As each system going for an distribution upgrade must first upgrade
to 4.4, where this gets handled, we can remove it now.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-26 09:37:22 +01:00
Thomas Lamprecht
0da2e042e1 watchdog mux: trailing whitespace cleanup
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-16 09:10:01 +01:00
Thomas Lamprecht
1dd1d6cd3a watchdog mux: fix comment, there's no systemd .socket anymore
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-16 09:09:47 +01:00
Thomas Lamprecht
5f09eb480d fix typo in simulator package description
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2018-01-15 13:15:48 +01:00
Fabian Grünbichler
a6b9892808 debian/rules: add some explaining comments
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-12-28 16:37:25 +01:00
Fabian Grünbichler
1abfa1f8ec debian/rules: don't dh_systemd_start watchdog-mux
as it's a static unit dh_systemd_starting it is not possible - but it gets
pulled in and started by pve-ha-crm/pve-ha-lrm anyway.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-12-28 16:37:25 +01:00
Fabian Grünbichler
449a03b794 debian/rules: add file names to dh_systemd_enable
otherwise it gets confused and enables pve-ha-crm twice in the postinst.

Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-12-28 16:37:25 +01:00
Thomas Lamprecht
cf1ad777ff buildsys: also cleanup *.buildinfo files
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-11-16 11:32:58 +01:00
Wolfgang Bumiller
5d82b887eb bump version to 2.0-4
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
2017-11-09 11:47:36 +01:00
Thomas Lamprecht
bf7febe377 wrap possible problematic cfs_read_file calls in eval
Wrap those calls to the cfs_read_file method, which may now also die
if there was a grave problem reading the file, into eval in all
methods which are used by the ha services.

The ones only used by API calls or CLI helpers are not wrapped, as
there it can be handled more gracefull (i.e., no watchdog is
running) and further, this is more intended to temporarily workaround
until we handle such an exception explicitly in the services - which
is a bit bigger change, so let's just go back to the old behavior for
now.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-11-09 11:42:12 +01:00
Thomas Lamprecht
f466005d20 swap native syslog command with HA environment one
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2017-11-08 06:01:46 +01:00