IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patch only removes lines of code. For personal gratification, giving a
detailed explanation of what the problem was.
When glusterd spawns the local brick process, say when a reboot of the node
occurs,the glusterd_brick_start() and subsequently the
glusterd_volume_start_glusterfs() function gets called twice; from
glusterd_spawn_daemons() and glusterd_do_volume_quorum_action() respectively.
This causes a race, best described by a pseudo-code of current behaviour.
glusterd_volume_start_glusterfs()
{
if(!brick process running) {
step-a) reap pid file( i.e. unlink it)
step-b) fork a brick process which creates and locks pid file and
binds the process to a socket.
}
}
Time Event
---- -----
T1 Call-1 arrives, completes step-a, starts step-b
T2 Call-2 arrives, enters step-a as Call-1's forked child is not
yet running.
T3 Call-1's forked child is alive, creates pidfile and locks it,binds
its address to a socket.
T4 Call-2 performs step-a; i.e.unlinks the pid file created by Call-1 !!
(files can still be stil be unlinked despite a lockf on it)
T5 Call-2 does step-b, and the forked child process creates a *new* pid file
with it's pid and locks this file.
T6 But Call-2's brick process is not able to bind to socket as it
is already in use (courtesy T3) and hence exits (so no locks anymore on the pidfile).
Result:
- Pid file now contains PID of an extinct brick process.
- `gluster volume status` shows this PID value. It also notices that there is no
lock held on pid file by the currently running brick process (created by Call-1)
and hence shows N/A for the online status.
Also, as a result of events at T4, "ls -l /proc/<brick process PID>/fd/pidfile"
shows up as deleted.
Fix:
1.Do not unlink pid file. i.e. avoid step-a. Now at T5,Call-2's brick process
cannot obtain lock on pid file (Call-1's process still holds it) and exits.
2. Unrelated, but remove lock-unlock sequence in glusterfs_pidfile_setup()
which does not seem to be doing anything useful.
Change-Id: I18d3e577bb56f68d85d3e0d0bfd268e50ed4f038
BUG: 1035586
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Reviewed-on: http://review.gluster.org/6786
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>