doc: document bit-rot feature
Change-Id: Ibad640d01975906b7642c76a1649e3e272f3a8bc BUG: 1170075 Signed-off-by: Venky Shankar <vshankar@redhat.com> Reviewed-on: http://review.gluster.org/9712 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
This commit is contained in:
parent
866c64ba5e
commit
cd3d34289c
8
doc/features/bit-rot/00-INDEX
Normal file
8
doc/features/bit-rot/00-INDEX
Normal file
@ -0,0 +1,8 @@
|
||||
00-INDEX
|
||||
- this file
|
||||
bitrot-docs.txt
|
||||
- links to design, spec and feature page
|
||||
object-versioning.txt
|
||||
- object versioning mechanism to track object signature
|
||||
memory-usage.txt
|
||||
- memory usage during object expiry tracking
|
5
doc/features/bit-rot/bitrot-docs.txt
Normal file
5
doc/features/bit-rot/bitrot-docs.txt
Normal file
@ -0,0 +1,5 @@
|
||||
* Feature page: http://www.gluster.org/community/documentation/index.php/Features/BitRot
|
||||
|
||||
* Design: http://goo.gl/Mjy4mD
|
||||
|
||||
* CLI specification: http://goo.gl/2o12Fn
|
48
doc/features/bit-rot/memory-usage.txt
Normal file
48
doc/features/bit-rot/memory-usage.txt
Normal file
@ -0,0 +1,48 @@
|
||||
object expiry tracking memroy usage
|
||||
====================================
|
||||
|
||||
Bitrot daemon tracks objects for expiry in a data structure known
|
||||
as "timer-wheel" (after which the object is signed). It's a well
|
||||
known data structure for tracking million of objects of expiry.
|
||||
Let's see the memory usage involved when tracking 1 million
|
||||
objects (per brick).
|
||||
|
||||
Bitrot daemon uses "br_object" structure to hold information
|
||||
needed for signing. An instance of this structure is allocated
|
||||
for each object that needs to be signed.
|
||||
|
||||
struct br_object {
|
||||
xlator_t *this;
|
||||
|
||||
br_child_t *child;
|
||||
|
||||
void *data;
|
||||
uuid_t gfid;
|
||||
unsigned long signedversion;
|
||||
|
||||
struct list_head list;
|
||||
};
|
||||
|
||||
Timer-wheel requires an instance of the structure below per
|
||||
object that needs to be tracked for expiry.
|
||||
|
||||
struct gf_tw_timer_list {
|
||||
void *data;
|
||||
unsigned long expires;
|
||||
|
||||
/** callback routine */
|
||||
void (*function)(struct gf_tw_timer_list *, void *, unsigned long);
|
||||
|
||||
struct list_head entry;
|
||||
};
|
||||
|
||||
Structure sizes:
|
||||
sizeof (struct br_object): 64 bytes
|
||||
sizeof (struct gf_tw_timer_list): 40 bytes
|
||||
|
||||
Together, these structures take up 104 bytes. To track all 1 million objects
|
||||
at the same time, the amount of memory taken up would be:
|
||||
|
||||
1,000,000 * 104 bytes: ~100MB
|
||||
|
||||
Not so bad, I think.
|
236
doc/features/bit-rot/object-versioning.txt
Normal file
236
doc/features/bit-rot/object-versioning.txt
Normal file
@ -0,0 +1,236 @@
|
||||
Object versioning
|
||||
=================
|
||||
|
||||
Bitrot detection in GlusterFS relies on object (file) checksum (hash) verification,
|
||||
also known as "object signature". An object is signed when there are no active
|
||||
file desciptors referring to it's inode (i.e., upon last close()). This is just an
|
||||
hint for the initiation of hash calculation (and therefore signing). There is
|
||||
absolutely no control over when clients can initiate modification operations on
|
||||
the object. An object could be under modification while it's hash computation is
|
||||
under progress. It would also be in-appropriate to restrict access to such objects
|
||||
during the time duration of signing.
|
||||
|
||||
Object versioning is used as a mechanism to identify the staleness of an objects
|
||||
signature. The document below does not just list down the version update protocol,
|
||||
but goes through various factors that led to its design.
|
||||
|
||||
NOTE: The word "object" is used to represent a "regular file" (in linux sense) and
|
||||
object versions are persisted in extended attributes of the object's inode.
|
||||
Signature calculation includes object's data (no metadata as of now).
|
||||
|
||||
INDEX
|
||||
=====
|
||||
i. Version updation protocol
|
||||
ii. Correctness guaraantees
|
||||
iii. Implementation
|
||||
iv. Protocol enhancements
|
||||
|
||||
i. Version updation protocol
|
||||
============================
|
||||
There are two types of versions associated with an object:
|
||||
|
||||
a) Ongoing version: This version is incremented on first open() [when
|
||||
the in-memory representation of the object (inode) is marked dirty
|
||||
and synchronized to disk. When an object is created, a default ongoing
|
||||
version of one (1) is assigned. An object lookup() too assigns the
|
||||
default version if not present. When a version is initialized upon
|
||||
lookup() or creat() FOP, it need to be durable on disk and therefore
|
||||
can just be a extended attrbute set with out an expensive fsync()
|
||||
syscall.
|
||||
|
||||
b) Signing version: This is the version against which an object is deemed
|
||||
to be signed. An objects signature is tied to a particular signed version.
|
||||
Since, an object is a candidate for signing upon last release() [last
|
||||
close()], signing version is the "ongoing version" at that point of time
|
||||
|
||||
An object's signature is trustable when the version it was signed against
|
||||
matches the ongoing version, i.e., if the hash is calculated by hand and
|
||||
compared against the object signature, it *should* be a perfect match if
|
||||
and only if the versions are equal. On the other hand, the signature is
|
||||
considered stale (might or might not match the hash just calculated).
|
||||
|
||||
Initialization of object versions
|
||||
---------------------------------
|
||||
An object that existed before the pre versioning days, is assigned the
|
||||
default versions upon lookup(). The protocol at this point expects "no"
|
||||
durability guarantess of the versions, i.e., extended attribute sets
|
||||
need not be followed by an explicit filesystem sync (fsync()). In case
|
||||
of a power outage or a crash, versions are re-initialized with defaults
|
||||
if found to be non-existant. The signing version is initialized with a
|
||||
deafault value of zero (0) and the ongoing version as one (1).
|
||||
|
||||
[
|
||||
NOTE: If an object already has versions on-disk, lookup() just brings
|
||||
the versions in memory. In this case both versions may or may
|
||||
not match depending on state the object was left in.
|
||||
]
|
||||
|
||||
|
||||
Increment of object versions
|
||||
----------------------------
|
||||
During initial versioning, the in-memory representation of the object is
|
||||
marked dirty, so that subsequent modification operations on the object
|
||||
triggers a versiong synchronization to disk (extended attribute set).
|
||||
Moreover, this operation needs to be durable on disk, for the protocol
|
||||
to be crash consistent.
|
||||
|
||||
Let's picturize the various version states after subsequent open()s.
|
||||
Not all modification operations need to increment the ongoing version,
|
||||
only the first operations needs to (subsequent operations are NO-OPs).
|
||||
|
||||
NOTE: From here one "[s]" depicts a durable filesystem operation and
|
||||
"*" depicts the inode as dirty.
|
||||
|
||||
|
||||
lookup() open() open() open()
|
||||
===========================================================
|
||||
|
||||
OV(m): 1* 2 2 2
|
||||
-----------------------------------------
|
||||
OV(d): 1 2[s] 2 2
|
||||
SV(d): 0 0 0 0
|
||||
|
||||
|
||||
Let's now picturize the state when an already signed object undergoes
|
||||
file operations.
|
||||
|
||||
on-disk state:
|
||||
OV(d): 3
|
||||
SV(d): 3|<signature>
|
||||
|
||||
|
||||
lookup() open() open() open()
|
||||
===========================================================
|
||||
|
||||
OV(m): 3* 4 4 4
|
||||
-----------------------------------------
|
||||
OV(d): 3 4[s] 4 4
|
||||
SV(d): 3 3 3 3
|
||||
|
||||
Signing process
|
||||
---------------
|
||||
As per the above example, when the last open file descriptor is closed,
|
||||
signing needs to be performed. The protocol restricts that the signing
|
||||
needs to be attached to a version, which in this case is the in-memory
|
||||
value of the ongoing version. A release() also marks the inode dirty,
|
||||
therefore, the next open() does a durable version synchronization to
|
||||
disk.
|
||||
|
||||
[carry forwarding the versions from earlier example]
|
||||
|
||||
close() release() open() open()
|
||||
===========================================================
|
||||
|
||||
OV(m): 4 4* 5 5
|
||||
-----------------------------------------
|
||||
OV(d): 4 4 5[s] 5
|
||||
SV(d): 3 3 3 3
|
||||
|
||||
As shown above, a relase() call triggers a signing with signing version
|
||||
as OV(m): which in this case is 4. During signing, the object is signed
|
||||
with a signature attached to version 4 as shown below (continuing with
|
||||
the last open() call from above):
|
||||
|
||||
open() sign(4, signature)
|
||||
===========================================================
|
||||
|
||||
OV(m): 5 5
|
||||
-----------------------------------------
|
||||
OV(d): 5 5
|
||||
SV(d): 3 4:<signature>[s]
|
||||
|
||||
A signature comparison at this point of time is un-trustable due to
|
||||
version mismatches. This also protects from node crashes and hard
|
||||
reboots due to durability guarantee of on-disk version on first
|
||||
open().
|
||||
|
||||
close() release() open()
|
||||
===========================================================
|
||||
|
||||
OV(m): 4 4* 5
|
||||
-------------------------------- CRASH
|
||||
OV(d): 4 4 5[s]
|
||||
SV(d): 3 3 3
|
||||
|
||||
The protocol is immune to signing request after crashes due to
|
||||
the version synchronization performed on first open(). Signing
|
||||
request for a version lesser than the *current* ongoing version
|
||||
can be ignored. It's left upon the implementation to either
|
||||
accept or ignore such signing request(s).
|
||||
|
||||
[
|
||||
NOTE: Inode forget() causes a fresh lookup() to be trigerred.
|
||||
Since a forget() call is received when there are no
|
||||
active references for an inode, the on-disk version is
|
||||
the latest and would be copied in-memory on lookup().
|
||||
]
|
||||
|
||||
ii. Correctness Guarantees
|
||||
==========================
|
||||
|
||||
Concurrent open()'s
|
||||
-------------------
|
||||
When an inode is dirty (i.e., the very next operations would try to
|
||||
synchronize the version to disk), there can be multiple calls [say,
|
||||
open()] that would find the inode state as dirty and try to writeback
|
||||
the new version to disk. Also, note that, marking the inode as synced
|
||||
and updating the in-memory version is done *after* the new version
|
||||
is written on disk. This is done to avoid incorrect version stored
|
||||
on-disk in case the version synchronization fails (but the in-memory
|
||||
version still holding the updated value).
|
||||
Coming back to multiple open() calls on an object, each open() call
|
||||
tries to synchronize the new version to disk if the inode is marked
|
||||
as dirty. This is safe as each open() would try to synchronize the
|
||||
new version (ongoingversion + 1) even if the updation is concurrent.
|
||||
The in-memory version is finally updated to reflect the updated
|
||||
version and mark the inode non-dirty. Again this is done *only* if
|
||||
the inode is dirty, thereby open() calls which updated the on-disk
|
||||
version but lost the race to update the in-memory version result
|
||||
are NO-OPs.
|
||||
|
||||
on-disk state:
|
||||
OV(d): 3
|
||||
SV(d): 3|<signature>
|
||||
|
||||
|
||||
lookup() open() open()' open()' open()
|
||||
=============================================================
|
||||
|
||||
OV(m): 3* 3* 3* 4 NO-OP
|
||||
--------------------------------------------------
|
||||
OV(d): 3 4[s] 4[s] 4 4
|
||||
SV(d): 3 3 3 3 3
|
||||
|
||||
|
||||
open()/release() race
|
||||
---------------------
|
||||
This race can cause a release() [on last close()] to pick up the
|
||||
ongoing version which was just incremented on fresh open(). This
|
||||
leads to signing of the object with the same version as the
|
||||
ongoing version, thereby, mismatching signatures when calculated.
|
||||
Another point that's worth mentioning here is that the open
|
||||
file descriptor is *attached* to it's inode *after* it's done
|
||||
version synchronization (and increment). Hence, if a release()
|
||||
sneaks in this window, the file desriptor list for the given
|
||||
inode is still empty, therefore release() considering it as a
|
||||
last close().
|
||||
To counter this, the protocol should track the open and release
|
||||
counts for file descriptors. A release() should only trigger a
|
||||
signing request when the file desccriptor for an inode is empty
|
||||
and the numbers of releases match the number of opens. When an
|
||||
open() sneaks and increments the ongoing version but the file
|
||||
descriptor is still not attached to the inode, open and release
|
||||
counts mismatch, hence identifying an open() in progress.
|
||||
|
||||
|
||||
iii. Implementation
|
||||
===================
|
||||
Refer to: xlators/feature/bit-rot/src/stub
|
||||
|
||||
iv. Protocol enhancements
|
||||
=========================
|
||||
|
||||
a) Delaying persisting on-disk versions till open()
|
||||
b) Lazy version updation (until signing?)
|
||||
c) Protocol changes required to handle anonymous file
|
||||
descriptors in GlusterFS.
|
Loading…
x
Reference in New Issue
Block a user