IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
in this package we provide api functions, thus we want to activate
the pve-api-update trigger, so that packages like pve-manager get
notified about it. But we also use api functions directly so we setup
an interest in the pve-api-update trigger. This results in an lintian
error (lintian version from buster or newer) which we can override:
> [...]
> This tag is also triggered if the package has an activate trigger
> for something on which it also declares an interest. The only (but
> rather unlikely) reason to do this is if another package also
> declares an interest and this package needs to activate that other
> package. If the package is using it for this exact purpose, then
> please use a Lintian override to state this.
-- https://lintian.debian.org/tags/repeated-trigger-name.html
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This call was missed in the commit moving it from
PVE::HA::Tools to PVE::HA:Config.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 0087839aa530 ("Tools: remove dependency on PVE::Cluster")
Allow an admin to set a datacenter wide HA policy which can change
the way we handle services on a node shutdown.
There's:
* freeze: always freeze servivces, independent of the shutdown type
(reboot, poweroff)
* failover: never freeze services, this means that a service will get
recovered to another node if possible and if the current node does
not comes back up in the grace period of 1 minute.
* default: this is the current behavior, freeze on reboot but do not
freeze on poweroff
Add to tests, shutdown-policy1 which is based of the reboot1 test,
but enforces no freeze with a failover policy, and shutdown-policy2
which is based on the shutdown1 test but with a explicit freeze
policy. You can compare (diff) each tests log result to the test it's
based on to see what changes.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
use dpkg-buildpackage and debhelper properly, add missing dependencies and
embed used perl modules from libpve-common-perl to make pve-ha-simulator
standalone.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
by moving parse_sid to PVE::HA::Env, with the default implementation in
PVE::HA::Config.
the bash completion methods use PVE::HA::Config (and PVE::Cluster), but
the corresponding use statements are only in PVE::CLI::ha_manager, where the
bash completion is actually used.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
to avoid unnecessary dependency on PVE::Cluster in PVE::HA::Tools.
reading the LRM status file was the only instance of reading from the
CFS via this method.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
and use PVE::HA::Groups to parse the config when testing/simulating.
this allows us to drop the dependency on PVE::HA::Config, which would
otherwise pull in a lot of additional depdendencies that we don't want
in the simulator.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
since we want to test the version from the current working tree, and not
the installed one.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
We ignored if the cluster state update failed and happily worked with
an empty state, resulting in strange actions, e.g., the removal of
all (not so) "stale" services or changing the all but the masters
node state to unknown.
Check on the update result and if failed, either do not get active,
or, if already active, skip the current round with the knowledge
that we only got here because the update failed but our lock renew
worked => cfs got already in a working and quorate state again -
(probably just a restart)
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
We updated the CRM and LRM view of the cluster state only in the PVE2
environment, outside of all regression testing and simulation scope.
Further, we ignored if this update failed and happily worked with an
empty state, resulting in strange actions, e.g., the removal of all
(not so) "stale" services or changing the all but the masters node
state to unknown.
This patch tries to improve this by moving out the update in a own
environment method, cluster_update_state, calling this in the LRM and
CRM and saving its result.
As with our introduced functionallity to simulate cfs rw or update
errors we can also simulate failures of this state update with the RT
system.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
We called them at similar times anyways, and have them under the
regression test cover with this change.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
Mainly addresses a problem where we read the manager status without
catching any possible exceptions.
As this was done only to check if our node has active fencing jobs,
which tells us that it makes no sense to even try to acquire the
manager lock - as we're fenced soon anyway.
Besides this check we always checked if we're quorate and if there
are services configured, so move
both checks in the new 'can_get_active' method, which replaces the
check_pending_fencing and the has_services method.
Move the quorum check in front and catch a possible error from the
following manager status read.
As a side effect the state transition code gets a bit shorter without
hiding the check intention.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
we may get an error here if the cluster filesystem is (temporarily)
unavailable here, this error resulted in stopping the whole CRM
service immediately, which then triggered a node reset (if happened
on the current master), even if we had still time left to retry and
thus, for example, handle a update of pve-cluster gracefully.
Add a method which wraps the status read in an eval and logs an
eventual error, but does not abort the service. Instead we rely on
our get_protected_ha_agent_lock method to detect a problem and switch
to the lost_agent_lock state.
If the pmxcfs outage was really short, so that the manager status
read failed but the lock update worked again we update also always
before doing real work when in the 'active' state. If this update
fails we return from the eval and try next round again, as no point
in doing anything without consistent state.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
Add simulated hardware commands for the cluster file system.
This allows to tell the regression test or simulator system that a
certain nodes calls to methods accessing the CFS should fail, i.e.,
die.
With this we can cover a situation which mainly happen during a
cluster file system update.
For now allow to define if the CFS is read-/writeable (state rw) and
if updates of the CFS (state update) should work or fail.
Add 'can read/write' assertions all over the relevant methods.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
This was introduced for cleaning up an possible left over systemd
watchdog mux enable link, which is gone for good now.
Then it was extended with trigger targets, as the HA Manager services
now restart when the pve-api-update trigger fires.
As the autogenerated postinst does the same unconditionally for the
pve-ha-lrm.service and pve-ha-crm.service already we may remove it
too.
The only difference is that try-restart is used by the auto generated
script, not reload-or-try-restart, but this does not matter, as the
HA services have currently no reload ability.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This was copied by accident when adding the transitional code for
removing the left over of the systemd managed watchdog mux in
commit f8a3fc80af299e613c21c9b67e29aee8cc807018
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This transitional code was added first with
commit f8a3fc80af299e613c21c9b67e29aee8cc807018
and fixed up with
commit ecc145c9724f056549e5458f17d7714ac8c83459
during Proxmox VE 4.1 and 4.2 to remove the problematic systemd
managed watchdog mux socket.
As each system going for an distribution upgrade must first upgrade
to 4.4, where this gets handled, we can remove it now.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
as it's a static unit dh_systemd_starting it is not possible - but it gets
pulled in and started by pve-ha-crm/pve-ha-lrm anyway.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
otherwise it gets confused and enables pve-ha-crm twice in the postinst.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Wrap those calls to the cfs_read_file method, which may now also die
if there was a grave problem reading the file, into eval in all
methods which are used by the ha services.
The ones only used by API calls or CLI helpers are not wrapped, as
there it can be handled more gracefull (i.e., no watchdog is
running) and further, this is more intended to temporarily workaround
until we handle such an exception explicitly in the services - which
is a bit bigger change, so let's just go back to the old behavior for
now.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>