doc: document bit-rot feature

Change-Id: Ibad640d01975906b7642c76a1649e3e272f3a8bc
BUG: 1170075
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/9712
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
This commit is contained in:
Venky Shankar 2015-02-18 17:01:21 +05:30 committed by Vijay Bellur
parent 866c64ba5e
commit cd3d34289c
4 changed files with 297 additions and 0 deletions

View File

@ -0,0 +1,8 @@
00-INDEX
- this file
bitrot-docs.txt
- links to design, spec and feature page
object-versioning.txt
- object versioning mechanism to track object signature
memory-usage.txt
- memory usage during object expiry tracking

View File

@ -0,0 +1,5 @@
* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot
* Design: http://goo.gl/Mjy4mD
* CLI specification: http://goo.gl/2o12Fn

View File

@ -0,0 +1,48 @@
object expiry tracking memroy usage
====================================
Bitrot daemon tracks objects for expiry in a data structure known
as "timer-wheel" (after which the object is signed). It's a well
known data structure for tracking million of objects of expiry.
Let's see the memory usage involved when tracking 1 million
objects (per brick).
Bitrot daemon uses "br_object" structure to hold information
needed for signing. An instance of this structure is allocated
for each object that needs to be signed.
struct br_object {
xlator_t *this;
br_child_t *child;
void *data;
uuid_t gfid;
unsigned long signedversion;
struct list_head list;
};
Timer-wheel requires an instance of the structure below per
object that needs to be tracked for expiry.
struct gf_tw_timer_list {
void *data;
unsigned long expires;
/** callback routine */
void (*function)(struct gf_tw_timer_list *, void *, unsigned long);
struct list_head entry;
};
Structure sizes:
sizeof (struct br_object): 64 bytes
sizeof (struct gf_tw_timer_list): 40 bytes
Together, these structures take up 104 bytes. To track all 1 million objects
at the same time, the amount of memory taken up would be:
1,000,000 * 104 bytes: ~100MB
Not so bad, I think.

View File

@ -0,0 +1,236 @@
Object versioning
=================
Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
also known as "object signature". An object is signed when there are no active
file desciptors referring to it's inode (i.e., upon last close()). This is just an
hint for the initiation of hash calculation (and therefore signing). There is
absolutely no control over when clients can initiate modification operations on
the object. An object could be under modification while it's hash computation is
under progress. It would also be in-appropriate to restrict access to such objects
during the time duration of signing.
Object versioning is used as a mechanism to identify the staleness of an objects
signature. The document below does not just list down the version update protocol,
but goes through various factors that led to its design.
NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
object versions are persisted in extended attributes of the object's inode.
Signature calculation includes object's data (no metadata as of now).
INDEX
=====
i. Version updation protocol
ii. Correctness guaraantees
iii. Implementation
iv. Protocol enhancements
i. Version updation protocol
============================
There are two types of versions associated with an object:
a) Ongoing version: This version is incremented on first open() [when
the in-memory representation of the object (inode) is marked dirty
and synchronized to disk. When an object is created, a default ongoing
version of one (1) is assigned. An object lookup() too assigns the
default version if not present. When a version is initialized upon
lookup() or creat() FOP, it need to be durable on disk and therefore
can just be a extended attrbute set with out an expensive fsync()
syscall.
b) Signing version: This is the version against which an object is deemed
to be signed. An objects signature is tied to a particular signed version.
Since, an object is a candidate for signing upon last release() [last
close()], signing version is the "ongoing version" at that point of time
An object's signature is trustable when the version it was signed against
matches the ongoing version, i.e., if the hash is calculated by hand and
compared against the object signature, it *should* be a perfect match if
and only if the versions are equal. On the other hand, the signature is
considered stale (might or might not match the hash just calculated).
Initialization of object versions
---------------------------------
An object that existed before the pre versioning days, is assigned the
default versions upon lookup(). The protocol at this point expects "no"
durability guarantess of the versions, i.e., extended attribute sets
need not be followed by an explicit filesystem sync (fsync()). In case
of a power outage or a crash, versions are re-initialized with defaults
if found to be non-existant. The signing version is initialized with a
deafault value of zero (0) and the ongoing version as one (1).
[
NOTE: If an object already has versions on-disk, lookup() just brings
the versions in memory. In this case both versions may or may
not match depending on state the object was left in.
]
Increment of object versions
----------------------------
During initial versioning, the in-memory representation of the object is
marked dirty, so that subsequent modification operations on the object
triggers a versiong synchronization to disk (extended attribute set).
Moreover, this operation needs to be durable on disk, for the protocol
to be crash consistent.
Let's picturize the various version states after subsequent open()s.
Not all modification operations need to increment the ongoing version,
only the first operations needs to (subsequent operations are NO-OPs).
NOTE: From here one "[s]" depicts a durable filesystem operation and
"*" depicts the inode as dirty.
lookup() open() open() open()
===========================================================
OV(m): 1* 2 2 2
-----------------------------------------
OV(d): 1 2[s] 2 2
SV(d): 0 0 0 0
Let's now picturize the state when an already signed object undergoes
file operations.
on-disk state:
OV(d): 3
SV(d): 3|<signature>
lookup() open() open() open()
===========================================================
OV(m): 3* 4 4 4
-----------------------------------------
OV(d): 3 4[s] 4 4
SV(d): 3 3 3 3
Signing process
---------------
As per the above example, when the last open file descriptor is closed,
signing needs to be performed. The protocol restricts that the signing
needs to be attached to a version, which in this case is the in-memory
value of the ongoing version. A release() also marks the inode dirty,
therefore, the next open() does a durable version synchronization to
disk.
[carry forwarding the versions from earlier example]
close() release() open() open()
===========================================================
OV(m): 4 4* 5 5
-----------------------------------------
OV(d): 4 4 5[s] 5
SV(d): 3 3 3 3
As shown above, a relase() call triggers a signing with signing version
as OV(m): which in this case is 4. During signing, the object is signed
with a signature attached to version 4 as shown below (continuing with
the last open() call from above):
open() sign(4, signature)
===========================================================
OV(m): 5 5
-----------------------------------------
OV(d): 5 5
SV(d): 3 4:<signature>[s]
A signature comparison at this point of time is un-trustable due to
version mismatches. This also protects from node crashes and hard
reboots due to durability guarantee of on-disk version on first
open().
close() release() open()
===========================================================
OV(m): 4 4* 5
-------------------------------- CRASH
OV(d): 4 4 5[s]
SV(d): 3 3 3
The protocol is immune to signing request after crashes due to
the version synchronization performed on first open(). Signing
request for a version lesser than the *current* ongoing version
can be ignored. It's left upon the implementation to either
accept or ignore such signing request(s).
[
NOTE: Inode forget() causes a fresh lookup() to be trigerred.
Since a forget() call is received when there are no
active references for an inode, the on-disk version is
the latest and would be copied in-memory on lookup().
]
ii. Correctness Guarantees
==========================
Concurrent open()'s
-------------------
When an inode is dirty (i.e., the very next operations would try to
synchronize the version to disk), there can be multiple calls [say,
open()] that would find the inode state as dirty and try to writeback
the new version to disk. Also, note that, marking the inode as synced
and updating the in-memory version is done *after* the new version
is written on disk. This is done to avoid incorrect version stored
on-disk in case the version synchronization fails (but the in-memory
version still holding the updated value).
Coming back to multiple open() calls on an object, each open() call
tries to synchronize the new version to disk if the inode is marked
as dirty. This is safe as each open() would try to synchronize the
new version (ongoingversion + 1) even if the updation is concurrent.
The in-memory version is finally updated to reflect the updated
version and mark the inode non-dirty. Again this is done *only* if
the inode is dirty, thereby open() calls which updated the on-disk
version but lost the race to update the in-memory version result
are NO-OPs.
on-disk state:
OV(d): 3
SV(d): 3|<signature>
lookup() open() open()' open()' open()
=============================================================
OV(m): 3* 3* 3* 4 NO-OP
--------------------------------------------------
OV(d): 3 4[s] 4[s] 4 4
SV(d): 3 3 3 3 3
open()/release() race
---------------------
This race can cause a release() [on last close()] to pick up the
ongoing version which was just incremented on fresh open(). This
leads to signing of the object with the same version as the
ongoing version, thereby, mismatching signatures when calculated.
Another point that's worth mentioning here is that the open
file descriptor is *attached* to it's inode *after* it's done
version synchronization (and increment). Hence, if a release()
sneaks in this window, the file desriptor list for the given
inode is still empty, therefore release() considering it as a
last close().
To counter this, the protocol should track the open and release
counts for file descriptors. A release() should only trigger a
signing request when the file desccriptor for an inode is empty
and the numbers of releases match the number of opens. When an
open() sneaks and increments the ongoing version but the file
descriptor is still not attached to the inode, open and release
counts mismatch, hence identifying an open() in progress.
iii. Implementation
===================
Refer to: xlators/feature/bit-rot/src/stub
iv. Protocol enhancements
=========================
a) Delaying persisting on-disk versions till open()
b) Lazy version updation (until signing?)
c) Protocol changes required to handle anonymous file
descriptors in GlusterFS.