doc: added documentation for dispersed volumes

Change-Id: I8a8368bdbe31af30a239aaf8cc478429e10c3f57 BUG: 1147563 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8885 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2014-09-29 15:48:55 +02:00 · 2014-09-29 15:48:55 +02:00 · 36d2975714
commit 36d2975714
parent 535c425911
2 changed files with 170 additions and 2 deletions
--- a/doc/admin-guide/en-US/markdown/admin_setting_volumes.md
+++ b/doc/admin-guide/en-US/markdown/admin_setting_volumes.md
@ -52,11 +52,24 @@ start it before attempting to mount it.
        and performance is critical. In this release, configuration of
        this volume type is supported only for Map Reduce workloads.

+    -   **Dispersed** - Dispersed volumes are based on erasure codes,
+        providing space-efficient protection against disk or server failures.
+        It stores an encoded fragment of the original file to each brick in
+        a way that only a subset of the fragments is needed to recover the
+        original file. The number of bricks that can be missing without
+        losing access to data is configured by the administrator on volume
+        creation time.
+
+    -   **Distributed Dispersed** - Distributed dispersed volumes distribute
+        files across dispersed subvolumes. This has the same advantages of
+        distribute replicate volumes, but using disperse to store the data
+        into the bricks.
+
 **To create a new volume**

 -   Create a new volume :

-    `# gluster volume create [stripe  | replica ] [transport tcp | rdma | tcp, rdma] `
+    `# gluster volume create [stripe  | replica | disperse] [transport tcp | rdma | tcp, rdma] `

    For example, to create a volume called test-volume consisting of
    server3:/exp3 and server4:/exp4:
@ -389,6 +402,161 @@ of this volume type is supported only for Map Reduce workloads.

    >  Use the `force` option at the end of command if you want to create the volume in this case.

+##Creating Dispersed Volumes
+
+Dispersed volumes are based on erasure codes. It stripes the encoded data of
+files, with some redundancy addedd, across multiple bricks in the volume. You
+can use dispersed volumes to have a configurable level of reliability with a
+minimum space waste.
+
+**Redundancy**
+
+Each dispersed volume has a redundancy value defined when the volume is
+created. This value determines how many bricks can be lost without
+interrupting the operation of the volume. It also determines the amount of
+usable space of the volume using this formula:
+
+    <Usable size> = <Brick size> * (#Bricks - Redundancy)
+
+All bricks of a disperse set should have the same capacity otherwise, when
+the smaller brick becomes full, no additional data will be allowed in the
+disperse set.
+
+It's important to note that a configuration with 3 bricks and redundancy 1
+will have less usable space (66.7% of the total physical space) than a
+configuration with 10 bricks and redundancy 1 (90%). However the first one
+will be safer than the second one (roughly the probability of failure of
+the second configuration if more than 4.5 times bigger than the first one).
+
+For example, a dispersed volume composed by 6 bricks of 4TB and a redundancy
+of 2 will be completely operational even with two bricks inaccessible. However
+a third inaccessible brick will bring the volume down because it won't be
+possible to read or write to it. The usable space of the volume will be equal
+to 16TB.
+
+The implementation of erasure codes in GlusterFS limits the redundancy to a
+value smaller than #Bricks / 2 (or equivalently, redundancy * 2 < #Bricks).
+Having a redundancy equal to half of the number of bricks would be almost
+equivalent to a replica-2 volume, and probably a replicated volume will
+perform better in this case.
+
+**Optimal volumes**
+
+One of the worst things erasure codes have in terms of performance is the
+RMW (Read-Modify-Write) cycle. Erasure codes operate in blocks of a certain
+size and it cannot work with smaller ones. This means that if a user issues
+a write of a portion of a file that doesn't fill a full block, it needs to
+read the remaining portion from the current contents of the file, merge them,
+compute the updated encoded block and, finally, writing the resulting data.
+
+This adds latency, reducing performance when this happens. Some GlusterFS
+performance xlators can help to reduce or even eliminate this problem for
+some workloads, but it should be taken into account when using dispersed
+volumes for a specific use case.
+
+Current implementation of dispersed volumes use blocks of a size that depends
+on the number of bricks and redundancy: 512 * (#Bricks - redundancy) bytes.
+This value is also known as the stripe size.
+
+Using combinations of #Bricks/redundancy that give a power of two for the
+stripe size will make the disperse volume perform better in most workloads
+because it's more typical to write information in blocks that are multiple of
+two (for example databases, virtual machines and many applications).
+
+These combinations are considered *optimal*.
+
+For example, a configuration with 6 bricks and redundancy 2 will have a stripe
+size of 512 * (6 - 2) = 2048 bytes, so it's considered optimal. A configuration
+with 7 bricks and redundancy 2 would have a stripe size of 2560 bytes, needing
+a RMW cycle for many writes (of course this always depends on the use case).
+
+**To create a dispersed volume**
+
+1.  Create a trusted storage pool.
+
+2.  Create the dispersed volume:
+
+    `# gluster volume create [disperse [<count>]] [redundancy <count>] [transport tcp | rdma | tcp,rdma]`
+
+    A dispersed volume can be created by specifying the number of bricks in a
+    disperse set, by specifying the number of redundancy bricks, or both.
+
+    If *disperse* is not specified, or the _&lt;count&gt;_ is missing, the
+    entire volume will be treated as a single disperse set composed by all
+    bricks enumerated in the command line.
+
+    If *redundancy* is not specified, it is computed automatically to be the
+    optimal value. If this value does not exist, it's assumed to be '1' and a
+    warning message is shown:
+
+        # gluster volume create test-volume disperse 4 server{1..4}:/bricks/test-volume
+        There isn't an optimal redundancy value for this configuration. Do you want to create the volume with redundancy 1 ? (y/n)
+
+    In all cases where *redundancy* is automatically computed and it's not
+    equal to '1', a warning message is displayed:
+
+        # gluster volume create test-volume disperse 6 server{1..6}:/bricks/test-volume
+        The optimal redundancy for this configuration is 2. Do you want to create the volume with this value ? (y/n)
+
+    _redundancy_ must be greater than 0, and the total number of bricks must
+    be greater than 2 * _redundancy_. This means that a dispersed volume must
+    have a minimum of 3 bricks.
+
+    If the transport type is not specified, *tcp* is used as the default. You
+    can also set additional options if required, like in the other volume
+    types.
+
+    > **Note**:
+
+    > - Make sure you start your volumes before you try to mount them or
+    > else client operations after the mount will hang.
+
+    > - GlusterFS will fail to create a dispersed volume if more than one brick of a disperse set is present on the same peer.
+
+    > ```
+    # gluster volume create <volname> disperse 3 server1:/brick{1..3}
+    volume create: <volname>: failed: Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
+    Do you still want to continue creating the volume? (y/n)```
+
+    >  Use the `force` option at the end of command if you want to create the volume in this case.
+
+##Creating Distributed Dispersed Volumes
+
+Distributed dispersed volumes are the equivalent to distributed replicated
+volumes, but using dispersed subvolumes instead of replicated ones.
+
+**To create a distributed dispersed volume**
+
+1.  Create a trusted storage pool.
+
+2.  Create the distributed dispersed volume:
+
+    `# gluster volume create disperse <count> [redundancy <count>] [transport tcp | rdma tcp,rdma]`
+
+    To create a distributed dispersed volume, the *disperse* keyword and
+    &lt;count&gt; is mandatory, and the number of bricks specified in the
+    command line must must be a multiple of the disperse count.
+
+    *redundancy* is exactly the same as in the dispersed volume.
+
+    If the transport type is not specified, *tcp* is used as the default. You
+    can also set additional options if required, like in the other volume
+    types.
+
+    > **Note**:
+
+    > - Make sure you start your volumes before you try to mount them or
+    > else client operations after the mount will hang.
+
+    > - GlusterFS will fail to create a distributed dispersed volume if more than one brick of a disperse set is present on the same peer.
+
+    > ```
+    # gluster volume create <volname> disperse 3 server1:/brick{1..6}
+    volume create: <volname>: failed: Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
+    Do you still want to continue creating the volume? (y/n)```
+
+    > Use the `force` option at the end of command if you want to create the volume in this case.
+
 ##Starting Volumes

 You must start your volumes before you try to mount them.
--- a/doc/gluster.8
+++ b/doc/gluster.8
@ -36,7 +36,7 @@ The Gluster Console Manager is a command line utility for elastic volume managem
 \fB\ volume info [all|<VOLNAME>] \fR
 Display information about all volumes, or the specified volume.
 .TP
-\fB\ volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... \fR
+\fB\ volume create <NEW-VOLNAME> [stripe <COUNT>] [replica <COUNT>] [disperse [<COUNT>]] [redundancy <COUNT>] [transport <tcp|rdma|tcp,rdma>] <NEW-BRICK> ... \fR
 Create a new volume of the specified type using the specified bricks and transport type (the default transport type is tcp).
 To create a volume with both transports (tcp and rdma), give 'transport tcp,rdma' as an option.
 .TP