mirror of
https://github.com/samba-team/samba.git
synced 2025-01-11 05:18:09 +03:00
c315fce17e
Reviewed-by: Andrew Bartlett <abartlet@samba.org> Reviewed-by: Michael Adam <obnox@samba.org> Autobuild-User(master): Andrew Bartlett <abartlet@samba.org> Autobuild-Date(master): Fri Nov 6 13:43:45 CET 2015 on sn-devel-104
137 lines
7.2 KiB
Plaintext
137 lines
7.2 KiB
Plaintext
Tdb is a hashtable database with multiple concurrent writer and external
|
|
record lock support. For speed reasons, wherever possible tdb uses a shared
|
|
memory mapped area for data access. In its currently released form, it uses
|
|
fcntl byte-range locks to coordinate access to the data itself.
|
|
|
|
The tdb data is organized as a hashtable. Hash collisions are dealt with by
|
|
forming a linked list of records that share a hash value. The individual
|
|
linked lists are protected across processes with 1-byte fcntl locks on the
|
|
starting pointer of the linked list representing a hash value.
|
|
|
|
The external locking API of tdb allows one to lock individual records. Instead of
|
|
really locking individual records, the tdb API locks a complete linked list
|
|
with a fcntl lock.
|
|
|
|
The external locking API of tdb also allows one to lock the complete database, and
|
|
ctdb uses this facility to freeze databases during a recovery. While the
|
|
so-called allrecord lock is held, all linked lists and all individual records
|
|
are frozen alltogether. Tdb achieves this by locking the complete file range
|
|
with a single fcntl lock. Individual 1-byte locks for the linked lists
|
|
conflict with this. Access to records is prevented by the one large fnctl byte
|
|
range lock.
|
|
|
|
Fcntl locks have been chosen for tdb for two reasons: First they are portable
|
|
across all current unixes. Secondly they provide auto-cleanup. If a process
|
|
dies while holding a fcntl lock, the lock is given up as if it was explicitly
|
|
unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
|
|
dies for any reason the database will not stay blocked until reboot. This
|
|
robustness is very important for long-running services, a reboot is not an
|
|
option for most users of tdb.
|
|
|
|
Unfortunately, during stress testing, fcntl locks have turned out to be a major
|
|
problem for performance. The particular problem that was seen happens when
|
|
ctdb on a busy server does a recovery. A recovery means that ctdb has to
|
|
freeze all tdb databases for some time, usually a few seconds. This is done
|
|
with the allrecord lock. During the recovery phase on a busy server many smbd
|
|
processes try to access the tdb file with blocking fcntl calls. The specific
|
|
test in question easily reproduces 7,000 processes piling up waiting for
|
|
1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
|
|
allrecord lock, covering the whole file range. All 7,000 processes waiting for
|
|
1-byte fcntl locks are woken up, trying to acquire their lock. The special
|
|
implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
|
|
all fcntl lock operations with a single system-wide spinlock. If 7,000 process
|
|
waiting for the allrecord lock to become released this leads to a thundering
|
|
herd condition, all CPUs are spinning on that single spinlock.
|
|
|
|
Functionally the kernel is fine, eventually the thundering herd slows down and
|
|
every process correctly gets his share and locking range, but the performance
|
|
of the system while the herd is active is worse than expected.
|
|
|
|
The thundering herd is only the worst case scenario for fcntl lock use. The
|
|
single spinlock for fcntl operations is also a performance penalty for normal
|
|
operations. In the cluster case, every read and write SMB request has to do
|
|
two fcntl calls to provide correct SMB mandatory locks. The single spinlock
|
|
is one source of serialization for the SMB read/write requests, limiting the
|
|
parallelism that can be achieved in a multi-core system.
|
|
|
|
While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
|
|
locks to be a problem on Solaris as well. Ira pointed out that there is a
|
|
potential alternative locking mechanism that might be more scalable: Process
|
|
shared robust mutexes, as defined by Posix 2008 for example via
|
|
|
|
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
|
|
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html
|
|
|
|
Pthread mutexes provide one of the core mechanisms in posix threads to protect
|
|
in-process data structures from concurrent access by multiple threads. In the
|
|
Linux implementation, a pthread_mutex_t is represented by a data structure in
|
|
user space that requires no kernel calls in the uncontended case for locking
|
|
and unlocking. Locking and unlocking in the uncontended case is implemented
|
|
purely in user space with atomic CPU instructions and thus are very fast.
|
|
|
|
The setpshared functions indicate to the kernel that the mutex is about to be
|
|
shared between processes in a common shared memory area.
|
|
|
|
The process shared posix mutexes have the potential to replace fcntl locking
|
|
to coordinate mmap access for tdbs. However, they are missing the criticial
|
|
auto-cleanup property that fcntl provides when a process dies. A process that
|
|
dies hard while holding a shared mutex has no chance to clean up the protected
|
|
data structures and unlock the shared mutex. Thus with a pure process shared
|
|
mutex the mutex will remain locked forever until the data structures are
|
|
re-initialized from scratch.
|
|
|
|
With the robust mutexes defined by Posix the process shared mutexes have been
|
|
extended with a limited auto-cleanup property. If a mutex has been declared
|
|
robust, when a process exits while holding that mutex, the next process trying
|
|
to lock the mutex will get the special error message EOWNERDEAD. This informs
|
|
the caller that the data structures the mutex protects are potentially corrupt
|
|
and need to be cleaned up.
|
|
|
|
The error message EOWNERDEAD when trying to lock a mutex is an extension over
|
|
the fcntl functionality. A process that does a blocking fcntl lock call is not
|
|
informed about whether the lock was explicitly freed by a process still alive
|
|
or due to an unplanned process exit. At the time of this writing (February
|
|
2013), at least Linux and OpenSolaris also implement the robustness feature of
|
|
process-shared mutexes.
|
|
|
|
Converting the tdb locking mechanism from fcntl to mutexes has to take care of
|
|
both types of locks that are used on tdb files.
|
|
|
|
The easy part is to use mutexes to replace the 1-byte linked list locks
|
|
covering the individual hashes. Those can be represented by a mutex each.
|
|
|
|
Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
|
|
lock spanning all hash list locks simultaneously. This basic functionality is
|
|
not easily possible with mutexes. A mutex carries 1 bit of information, a
|
|
fcntl lock can carry an arbitrary amount of information.
|
|
|
|
In order to support the allrecord lock, we have an allrecord_lock variable
|
|
protected by an allrecord_mutex. The coordination between the allrecord lock
|
|
and the chainlocks works like this:
|
|
|
|
- Getting a chain lock works like this:
|
|
|
|
1. get chain mutex
|
|
2. return success if allrecord_lock is F_UNLCK (not locked)
|
|
3. return success if allrecord_lock is F_RDLCK (locked readonly)
|
|
and we only need a read lock.
|
|
4. release chain mutex
|
|
5. wait for allrecord_mutex
|
|
6. unlock allrecord_mutex
|
|
7. goto 1.
|
|
|
|
- Getting the allrecord lock:
|
|
|
|
1. get the allrecord mutex
|
|
2. return error if allrecord_lock is not F_UNLCK (it's locked)
|
|
3. set allrecord_lock to the desired value.
|
|
4. in a loop: lock(blocking) / unlock each chain mutex.
|
|
5. return success.
|
|
|
|
- allrecord lock upgrade:
|
|
|
|
1. check we already have the allrecord lock with F_RDLCK.
|
|
3. set allrecord_lock to F_WRLCK
|
|
4. in a loop: lock(blocking) / unlock each chain mutex.
|
|
5. return success.
|