IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
It's not much but repeated a few times, and as a next commit will add
another such time let's just refactor it to a local private helper
with a very explicit name and comment about what implications calling
it has.
Take the chance and add some more safety comments too.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This basically makes recovery just an active state transition, as can
be seen from the regression tests - no other semantic change is
caused.
For the admin this is much better to grasp than services still marked
as "fence" when the failed node is already fenced or even already up
again.
Code-wise it makes sense too, to make the recovery part not so hidden
anymore, but show it was it is: an actual part of the FSM
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
To see if just a bit or many tests are broken it is useful to
sometimes run all, and not just exit after first failure.
Allow this as opt-in feature.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
The service addition and deletion, and also the artificial delay
(useful to force continuation of the HW) commands where missing
completely.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
both, `override_dh_systemd_enable` and `override_dh_systemd_start`
are ignored with current compat level 12, and will become an error in
level >= 13, so drop them and use `override_dh_installsystemd` for
both of the previous uses.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
We do not need passing a target storage as the identity mapping
prefers replicated storage for a replicated disks already, and other
cases do not make sense anyway as they wouldn't work for HA
recovery..
We probably want to check the "really only replicated OK migrations"
in the respective API code paths for the "ha" RPC environment case,
though.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
those differ from the "managed" service in that that they do not
check the state at all, the just check if, or respectively delete, a
SID is in the config or not.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
To avoid early disconnect during shutdown ensure we order After them,
for shutdown the ordering is reversed and so we're stopped before
those two - this allows to checkout the node stats and do SSH stuff
if something fails.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
We simply remember the node we where on, if moved for maintenance.
This record gets dropped once we move to _any_ other node, be it:
* our previous node, as it came back from maintenance
* another node due to manual migration, group priority changes or
fencing
The first point is handled explicitly by this patch. In the select
service node we check for and old fallback node, if that one is found
in a online node list with top priority we _always_ move to it - even
if there's no real reason for a move.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This adds handling for a new shutdown policy, namely "migrate".
If that is set then the LRM doesn't queues stop jobs, but transitions
to a new mode, namely 'maintenance'.
The LRM modes now get passed from the CRM in the NodeStatus update
method, this allows to detect such a mode and make node-status state
transitions. Effectively we only allow to transition if we're
currently online, else this is ignored. 'maintenance' does not
protects from fencing.
The moving then gets done by select service node. A node in
maintenance mode is not in "list_online_nodes" and so also not in
online_node_usage used to re-calculate if a service needs to be
moved. Only started services will get moved, this can be done almost
by leveraging exiting behavior, the next_state_started FSM state
transition method just needs to be thought to not early return for
nodes which are not online but in maintenance mode.
A few tests get adapted from the other policy tests is added to
showcase behavior with reboot, shutdown, and shutdown of the current
manager. It also shows the behavior when a service cannot be
migrated, albeit as our test system is limited to simulate maximal 9
migration failures, it "seems" to succeed after that. But note here
that the maximal retries would have been hit way more earlier, so
this is just artifact from our test system.
Besides some implementation details two question still are not solved
by this approach:
* what if a service cannot be moved away, either by errors or as no
alternative node is found by select_service_node
- retrying indefinitely, this happens currently. The user set this
up like this in the first place. We will order SSH, pveproxy,
after the LRM service to ensure that the're still the possibility
for manual interventions
- a idea would be to track the time and see if we're stuck (this is
not to hard), in such a case we could stop the services after X
minutes and continue.
* a full cluster shutdown, but that is even without this mode not to
ideal, nodes will get fenced after no partition is quorate anymore,
already. And as long as it's just a central setting in DC config,
an admin has a single switch to flip to make it work, so not sure
how much handling we want to do here, if we go over the point where
we have no quorum we're dead anyhow, soo.. at least not really an
issue of this series, orthogonal related yes, but not more.
For real world usability the datacenter.cfg schema needs to be
changed to allow the migrate shutdown policy, but that's trivial
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
As the Service load is often still happening on the source, and the
target may feel the performance impact from an incoming migrate, so
account the service to both nodes during that time.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Remove further locks from a service after it was recovered from a
fenced node. This can be done due to the fact that the node was
fenced and thus the operation it was locked for was interrupted
anyway. We note in the syslog that we removed a lock.
Mostly we disallow the 'create' lock, as here is the only case where
we know that the service was not yet in a runnable state before.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
it's always good to say that we request it, not that people think the
task should have been already started..
Also include the service ID (SID), so people know what we want(ed) to
stop at all.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This should reduce confusion between the old 'set <sid> --state stopped' and
the new 'stop' command by making explicit that it is sent as a crm command.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Not every command parameter is 'target' anymore, so
it was necessary to modify the parsing of $sd->{cmd}.
Just changing the state to request_stop is not enough,
we need to actually update the service configuration as well.
Add a simple test for the stop command
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Introduces a timeout parameter for shutting a resource down.
If the parameter is 0, we perform a hard stop instead of a shutdown.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>