1
0
mirror of https://github.com/samba-team/samba.git synced 2024-12-25 23:21:54 +03:00
samba-mirror/ctdb/config/events.d
Ronnie Sahlberg fa872de664 60.nfs:
we must always restart the lockmanager when the cluster has been 
reconfigured and ip addresses has changed. This is to make sure we get a 
clusterwide grace period for nfs locking.
if we dont do this and only restart locking on the nodes that were 
direclty affected, a different client can take out a conflicting lock 
from a different node before affected clients has had a chance to
reclaim all the locks lost during reconfigure.
grace period on rhel5 kernel has bene increased to 90 seconds!

statd-callout:
we must restart lockmanager to ensure a clusterwide grace period for 
nfs. this makes locking "more correct" for nfs clients and prevents
other clients/nodes from taking out a conflicting lock while a different
client/node tries to reclaim lost locks.
This makes it "almost consistent" for NFS clients   but there is still 
the possibility that a cifs client can take out a conflicting lock 
before an nfs client has had a chance to reclaim an existing lock.
This can not be solved with anything less than making the kernel nfs 
lock manager "samba aware" and making samba aware of the internal state 
of the kernel lock manager so that they can cooperate.

we can not just stop/start the lockmanager back to back in rhel5 since 
if they are stopped/started too close to eachother then when the new 
lockmanager upon starting up sends out statd notifications two things 
can happen:
1, new lockmanager sends out notification BEFORE it has registered with 
portmapper leading to 
  lockmanager starts
  lockmanager sends notification to the client
  client tries to recover the lock and tries to portmap the lockmanager
  port on the server.
  server is not (yet) registered with portmapper and server responds
  "no such program" to hte clients request to discover where lockmanager
   is.
  client then just completely gives up reclaiming the lock and doesnt 
  even reattempt the portmapper call after some timeout.
  ==> lock reclaim failed.
2, if they are started back to back, and a client tries to reclaim the
   lock  the lockmanager sometimes sends two responses back to back
   to the client.   one with status NLM_GRANTED (==you got the lock 
reclaimed) and one with status NLM_DENIED (==you could not get the lock 
reclaimed)
   This confuses the client and leads to the server thinking that the 
client does have the lock   and the client thinking it has not got the 
lock    and orphaned locks result.


We also send out additional notification messages of different formats
to allow more legacy clients to interoperate with locking.

(This used to be ctdb commit 13208c1aab2942e28dff87e38e6794bf0c026033)
2007-09-07 08:52:56 +10:00
..
00.ctdb change the now rather small /etc/ctdb/events script into a service 2007-08-15 15:01:31 +10:00
10.interface change how we do public addresses and takeover so that we can have 2007-09-04 09:50:07 +10:00
40.vsftpd add a simple events script to manage vsftpd 2007-06-05 18:14:01 +10:00
50.samba start winbind before smbd 2007-08-16 11:34:35 +10:00
60.nfs 60.nfs: 2007-09-07 08:52:56 +10:00
61.nfstickle we dont use 'sendip' any more so dont check for it and exit from the 2007-09-05 15:39:51 +10:00
README fix typo 2007-08-15 11:38:27 +10:00

This directory is where you should put any local or application
specific event scripts for ctdb to call.

All event scripts start with the prefic 'NN.' where N is a digit.
The event scripts are run in sequence based on NN.
Thus 10.interfaces will be run before 60.nfs.

Each NN must be unique and duplicates will cause undefined behaviour.
I.e. having both 10.interfaces and 10.otherstuff is not allowed.


As a special case, any eventscript that ends with a '~' character will be 
ignored since this is a common postfix that some editors will append to 
older versions of a file.


The eventscripts are called with varying number of arguments.
The first argument is the "event" and the rest of the arguments depend
on which event was triggered.

The events currently implemented are
startup
	This event does not take any additional arguments.
	This event is only invoked once, when ctdb is starting up.
	This event is used to wait for the service to start and all
	resources for the service becoming available.

	This is used to prevent ctdb from starting up and advertize its
	services until all dependent services have become available.

	All services that are managed by ctdb should implement this
	event and use it to start the service.

	Example: 50.samba uses this event to start the samba daemon
	and then wait until samba and all its associated services have
	become available. It then also proceeds to wait until all
	shares have become available.

shutdown
	This event is called when the ctdb service is shuting down.
	
	All services that are managed by ctdb should implement this event
	and use it to perform a controlled shutdown of the service.

	Example: 60.nfs uses this event to shut down nfs and all associated
	services and stop exporting any shares when this event is invoked.

monitor
	This event is invoked every X number of seconds.
	The interval can be configured using the MonitorInterval tunable
	but defaults to 15 seconds.

	This event is triggered by ctdb to continously monitor that all
	managed services are healthy.
	When invoked, the event script will check that the service is healthy
	and return 0 if so. If the service is not healthy the event script
	should return non zero.

	If a service returns nonzero from this script this will cause ctdb
	to consider the node status as UNHEALTHY and will cause the public
	address and all associated services to be failed over to a different
	node in the cluster.

	All managed services should implement this event.

	Example: 10.interfaces which checks that the public interface (if used)
	is healthy, i.e. it has a physical link established.

takeip
	This event is triggered everytime the node takes over a public ip
	address during recovery.
	This event takes three additional arguments :
	'interface' 'ipaddress' and 'netmask'

	This event will always be followed by a 'recovered' event onse
	all ipaddresses have been reassigned to new nodes and the ctdb database
	has been recovered.
	If multiple ip addresses are reassigned during recovery it is
	possible to get several 'takeip' events followed by a single 
	'recovered' event.

	Since there might involve substantial work for the service when an ip
	address is taken over and since multiple ip addresses might be taken 
	over in a single recovery it is often best to only mark which addresses
	are being taken over in this event and defer the actual work to 
	reconfigure or restart the services until the 'recovered' event.

	Example: 60.nfs which just records which ip addresses are being taken
	over into a local state directory   and which defers the actual
	restart of the services until the 'recovered' event.


releaseip
	This event is triggered everytime the node releases a public ip
	address during recovery.
	This event takes three additional arguments :
	'interface' 'ipaddress' and 'netmask'

	In all other regards this event is analog to the 'takeip' event above.

	Example: 60.nfs

recovered
	This event is triggered everytime a full ctdb recovery has completed
	and all public ip addresses have been reassigned among the nodes.

	Example: 60.nfs which if the ip address configuration has changed
	during the recovery (i.e. if addresses have been taken over or
	released) will kill off any tcp connections that exist for that
	service and also send out statd notifications to all registered 
	clients.
	

Additional note for takeip, releaseip, recovered:

ALL services that depend on the ip address configuration of the node must 
implement all three of these events.

ALL services that use TCP should also implement these events and at least
kill off any tcp connections to the service if the ip address config has 
changed in a similar fashion to how 60.nfs does it.
The reason one must do this is that ESTABLISHED tcp connections may survive
when an ip address is released and removed from the host until the ip address
is re-takenover.
Any tcp connections that survive a release/takeip sequence can potentially
cause the client/server tcp connection to get out of sync with sequence and 
ack numbers and cause a disruptive ack storm.