doc: document bit-rot feature

Change-Id: Ibad640d01975906b7642c76a1649e3e272f3a8bc BUG: 1170075 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/9712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2015-02-18 17:01:21 +05:30 · 2015-02-18 17:01:21 +05:30 · cd3d34289c
commit cd3d34289c
parent 866c64ba5e
4 changed files with 297 additions and 0 deletions
--- a/doc/features/bit-rot/00-INDEX
+++ b/doc/features/bit-rot/00-INDEX
@ -0,0 +1,8 @@
+00-INDEX
+        - this file
+bitrot-docs.txt
+        - links to design, spec and feature page
+object-versioning.txt
+        - object versioning mechanism to track object signature
+memory-usage.txt
+        - memory usage during object expiry tracking
--- a/doc/features/bit-rot/bitrot-docs.txt
+++ b/doc/features/bit-rot/bitrot-docs.txt
@ -0,0 +1,5 @@
+* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot
+
+* Design: http://goo.gl/Mjy4mD
+
+* CLI specification: http://goo.gl/2o12Fn
--- a/doc/features/bit-rot/memory-usage.txt
+++ b/doc/features/bit-rot/memory-usage.txt
@ -0,0 +1,48 @@
+object expiry tracking memroy usage
+====================================
+
+Bitrot daemon tracks objects for expiry in a data structure known
+as "timer-wheel" (after which the object is signed). It's a well
+known data structure for tracking million of objects of expiry.
+Let's see the memory usage involved when tracking 1 million
+objects (per brick).
+
+Bitrot daemon uses "br_object" structure to hold information
+needed for signing. An instance of this structure is allocated
+for each object that needs to be signed.
+
+struct br_object {
+        xlator_t *this;
+
+        br_child_t *child;
+
+        void *data;
+        uuid_t gfid;
+        unsigned long signedversion;
+
+        struct list_head list;
+};
+
+Timer-wheel requires an instance of the structure below per
+object that needs to be tracked for expiry.
+
+struct gf_tw_timer_list {
+        void *data;
+        unsigned long expires;
+
+        /** callback routine */
+        void (*function)(struct gf_tw_timer_list *, void *, unsigned long);
+
+        struct list_head entry;
+};
+
+Structure sizes:
+  sizeof (struct br_object): 64 bytes
+  sizeof (struct gf_tw_timer_list): 40 bytes
+
+Together, these structures take up 104 bytes. To track all 1 million objects
+at the same time, the amount of memory taken up would be:
+
+  1,000,000 * 104 bytes: ~100MB
+
+Not so bad, I think.
--- a/doc/features/bit-rot/object-versioning.txt
+++ b/doc/features/bit-rot/object-versioning.txt
@ -0,0 +1,236 @@
+Object versioning
+=================
+
+  Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
+  also known as "object signature". An object is signed when there are no active
+  file desciptors referring to it's inode (i.e., upon last close()). This is just an
+  hint for the initiation of hash calculation (and therefore signing). There is
+  absolutely no control over when clients can initiate modification operations on
+  the object. An object could be under modification while it's hash computation is
+  under progress. It would also be in-appropriate to restrict access to such objects
+  during the time duration of signing.
+
+  Object versioning is used as a mechanism to identify the staleness of an objects
+  signature. The document below does not just list down the version update protocol,
+  but goes through various factors that led to its design.
+
+NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
+      object versions are persisted in extended attributes of the object's inode.
+      Signature calculation includes object's data (no metadata as of now).
+
+INDEX
+=====
+  i.   Version updation protocol
+  ii.  Correctness guaraantees
+  iii. Implementation
+  iv.  Protocol enhancements
+
+i. Version updation protocol
+============================
+  There are two types of versions associated with an object:
+
+  a) Ongoing version: This version is incremented on first open() [when
+     the in-memory representation of the object (inode) is marked dirty
+     and synchronized to disk. When an object is created, a default ongoing
+     version of one (1) is assigned. An object lookup() too assigns the
+     default version if not present. When a version is initialized upon
+     lookup() or creat() FOP, it need to be durable on disk and therefore
+     can just be a extended attrbute set with out an expensive fsync()
+     syscall.
+
+  b) Signing version: This is the version against which an object is deemed
+     to be signed. An objects signature is tied to a particular signed version.
+     Since, an object is a candidate for signing upon last release() [last
+     close()], signing version is the "ongoing version" at that point of time
+
+  An object's signature is trustable when the version it was signed against
+  matches the ongoing version, i.e., if the hash is calculated by hand and
+  compared against the object signature, it *should* be a perfect match if
+  and only if the versions are equal. On the other hand, the signature is
+  considered stale (might or might not match the hash just calculated).
+
+  Initialization of object versions
+  ---------------------------------
+     An object that existed before the pre versioning days, is assigned the
+     default versions upon lookup(). The protocol at this point expects "no"
+     durability guarantess of the versions, i.e., extended attribute sets
+     need not be followed by an explicit filesystem sync (fsync()). In case
+     of a power outage or a crash, versions are re-initialized with defaults
+     if found to be non-existant. The signing version is initialized with a
+     deafault value of zero (0) and the ongoing version as one (1).
+
+     [
+       NOTE: If an object already has versions on-disk, lookup() just brings
+             the versions in memory. In this case both versions may or may
+             not match depending on state the object was left in.
+     ]
+
+
+  Increment of object versions
+  ----------------------------
+     During initial versioning, the in-memory representation of the object is
+     marked dirty, so that subsequent modification operations on the object
+     triggers a versiong synchronization to disk (extended attribute set).
+     Moreover, this operation needs to be durable on disk, for the protocol
+     to be crash consistent.
+
+     Let's picturize the various version states after subsequent open()s.
+     Not all modification operations need to increment the ongoing version,
+     only the first operations needs to (subsequent operations are NO-OPs).
+
+     NOTE: From here one "[s]" depicts a durable filesystem operation and
+           "*" depicts the inode as dirty.
+
+
+                       lookup()     open()    open()    open()
+            ===========================================================
+
+            OV(m):        1*          2         2         2
+                      -----------------------------------------
+            OV(d):        1           2[s]      2         2
+            SV(d):        0           0         0         0
+
+
+     Let's now picturize the state when an already signed object undergoes
+     file operations.
+
+     on-disk state:
+          OV(d): 3
+          SV(d): 3|<signature>
+
+
+                       lookup()     open()    open()    open()
+            ===========================================================
+
+            OV(m):        3*          4         4         4
+                      -----------------------------------------
+            OV(d):        3           4[s]      4         4
+            SV(d):        3           3         3         3
+
+  Signing process
+  ---------------
+     As per the above example, when the last open file descriptor is closed,
+     signing needs to be performed. The protocol restricts that the signing
+     needs to be attached to a version, which in this case is the in-memory
+     value of the ongoing version. A release() also marks the inode dirty,
+     therefore, the next open() does a durable version synchronization to
+     disk.
+
+     [carry forwarding the versions from earlier example]
+
+                       close()     release()  open()   open()
+            ===========================================================
+
+            OV(m):        4           4*        5         5
+                      -----------------------------------------
+            OV(d):        4           4         5[s]      5
+            SV(d):        3           3         3         3
+
+     As shown above, a relase() call triggers a signing with signing version
+     as OV(m): which in this case is 4. During signing, the object is signed
+     with a signature attached to version 4 as shown below (continuing with
+     the last open() call from above):
+
+                       open()           sign(4, signature)
+            ===========================================================
+
+            OV(m):        5                     5
+                      -----------------------------------------
+            OV(d):        5                     5
+            SV(d):        3               4:<signature>[s]
+
+     A signature comparison at this point of time is un-trustable due to
+     version mismatches. This also protects from node crashes and hard
+     reboots due to durability guarantee of on-disk version on first
+     open().
+
+                       close()     release()  open()
+            ===========================================================
+
+            OV(m):        4           4*        5
+                      --------------------------------  CRASH
+            OV(d):        4           4         5[s]
+            SV(d):        3           3         3
+
+     The protocol is immune to signing request after crashes due to
+     the version synchronization performed on first open(). Signing
+     request for a version lesser than the *current* ongoing version
+     can be ignored. It's left upon the implementation to either
+     accept or ignore such signing request(s).
+
+     [
+        NOTE: Inode forget() causes a fresh lookup() to be trigerred.
+              Since a forget() call is received when there are no
+              active references for an inode, the on-disk version is
+              the latest and would be copied in-memory on lookup().
+     ]
+
+ii. Correctness Guarantees
+==========================
+
+     Concurrent open()'s
+     -------------------
+     When an inode is dirty (i.e., the very next operations would try to
+     synchronize the version to disk), there can be multiple calls [say,
+     open()] that would find the inode state as dirty and try to writeback
+     the new version to disk. Also, note that, marking the inode as synced
+     and updating the in-memory version is done *after* the new version
+     is written on disk. This is done to avoid incorrect version stored
+     on-disk in case the version synchronization fails (but the in-memory
+     version still holding the updated value).
+     Coming back to multiple open() calls on an object, each open() call
+     tries to synchronize the new version to disk if the inode is marked
+     as dirty. This is safe as each open() would try to synchronize the
+     new version (ongoingversion + 1) even if the updation is concurrent.
+     The in-memory version is finally updated to reflect the updated
+     version and mark the inode non-dirty. Again this is done *only* if
+     the inode is dirty, thereby open() calls which updated the on-disk
+     version but lost the race to update the in-memory version result
+     are NO-OPs.
+
+     on-disk state:
+          OV(d): 3
+          SV(d): 3|<signature>
+
+
+                       lookup()     open()    open()'   open()'  open()
+            =============================================================
+
+            OV(m):        3*          3*        3*        4      NO-OP
+                      --------------------------------------------------
+            OV(d):        3           4[s]      4[s]      4        4
+            SV(d):        3           3         3         3        3
+
+
+     open()/release() race
+     ---------------------
+     This race can cause a release() [on last close()] to pick up the
+     ongoing version which was just incremented on fresh open(). This
+     leads to signing of the object with the same version as the
+     ongoing version, thereby, mismatching signatures when calculated.
+     Another point that's worth mentioning here is that the open
+     file descriptor is *attached* to it's inode *after* it's done
+     version synchronization (and increment). Hence, if a release()
+     sneaks in this window, the file desriptor list for the given
+     inode is still empty, therefore release() considering it as a
+     last close().
+     To counter this, the protocol should track the open and release
+     counts for file descriptors. A release() should only trigger a
+     signing request when the file desccriptor for an inode is empty
+     and the numbers of releases match the number of opens. When an
+     open() sneaks and increments the ongoing version but the file
+     descriptor is still not attached to the inode, open and release
+     counts mismatch, hence identifying an open() in progress.
+
+
+iii. Implementation
+===================
+     Refer to: xlators/feature/bit-rot/src/stub
+
+iv. Protocol enhancements
+=========================
+
+     a) Delaying persisting on-disk versions till open()
+     b) Lazy version updation (until signing?)
+     c) Protocol changes required to handle anonymous file
+        descriptors in GlusterFS.