doc: add documentation for DHT
Change-Id: Iaa1ea72499a81134eb57a15867e0d08dd9c55bbd Signed-off-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-on: http://review.gluster.org/8240 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: N Balachandran <nbalacha@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
This commit is contained in:
parent
b45f623a7a
commit
a29d6c80df
223
doc/features/dht.md
Normal file
223
doc/features/dht.md
Normal file
@ -0,0 +1,223 @@
|
||||
# How GlusterFS Distribution Works
|
||||
|
||||
The defining feature of any scale-out system is its ability to distribute work
|
||||
or data among many servers. Accordingly, people in the distributed-system
|
||||
community have developed many powerful techniques to perform such distribution,
|
||||
but those techniques often remain little known or understood even among other
|
||||
members of the file system and database communities that benefit. This
|
||||
confusion is represented even in the name of the GlusterFS component that
|
||||
performs distribution - DHT, which stands for Distributed Hash Table but is not
|
||||
actually a DHT as that term is most commonly used or defined. The way
|
||||
GlusterFS's DHT works is based on a few basic principles:
|
||||
|
||||
* All operations are driven by clients, which are all equal. There are no
|
||||
special nodes with special knowledge of where files are or should be.
|
||||
|
||||
* Directories exist on all subvolumes (bricks or lower-level aggregations of
|
||||
bricks); files exist on only one.
|
||||
|
||||
* Files are assigned to subvolumes based on *consistent hashing*, and even
|
||||
more specifically a form of consistent hashing exemplified by Amazon's
|
||||
[Dynamo][dynamo].
|
||||
|
||||
The result of all this is that users are presented with a set of files that is
|
||||
the union of the files present on all subvolumes. The following sections
|
||||
describe how this "uniting" process actually works.
|
||||
|
||||
## Layouts
|
||||
|
||||
The conceptual basis of Dynamo-style consistent hashing is of numbers around a
|
||||
circle, like a clock. First, the circle is divided into segments and those
|
||||
segments are assigned to bricks. (For the sake of simplicity we'll use
|
||||
"bricks" hereafter even though they might actually be replicated/striped
|
||||
subvolumes.) Several factors guide this assignment.
|
||||
|
||||
* Assignments are done separately for each directory.
|
||||
|
||||
* Historically, segments have all been the same size. However, this can lead
|
||||
to smaller bricks becoming full while plenty of space remains on larger
|
||||
ones. If the *cluster.weighted-rebalance* option is set, segments sizes
|
||||
will be proportional to brick sizes.
|
||||
|
||||
* Assignments need not include all bricks in the volume. If the
|
||||
*cluster.subvols-per-directory* option is set, only that many bricks will
|
||||
receive assignments for that directory.
|
||||
|
||||
However these assignments are done, they collectively become what we call a
|
||||
*layout* for a directory. This layout is then stored using extended
|
||||
attributes, with each brick's copy of that extended attribute on that directory
|
||||
consisting of four 32-bit fields.
|
||||
|
||||
* A version, which might be DHT\_HASH\_TYPE\_DM to represent an assignment as
|
||||
described above, or DHT\_HASH\_TYPE\_DM\_USER to represent an assignment made
|
||||
manually by the user (or external script).
|
||||
|
||||
* A "commit hash" which will be described later.
|
||||
|
||||
* The first number in the assigned range (segment).
|
||||
|
||||
* The last number in the assigned range.
|
||||
|
||||
For example, the extended attributes representing a weighted assignment between
|
||||
three bricks, one twice as big as the others, might look like this.
|
||||
|
||||
* Brick A (the large one): DHT\_HASH\_TYPE\_DM 1234 0 0x7ffffff
|
||||
|
||||
* Brick B: DHT\_HASH\_TYPE\_DM 1234 0x80000000 0xbfffffff
|
||||
|
||||
* Brick C: DHT\_HASH\_TYPE\_DM 1234 0xc0000000 0xffffffff
|
||||
|
||||
## Placing Files
|
||||
|
||||
To place a file in a directory, we first need a layout for that directory - as
|
||||
described above. Next, we calculate a hash for the file. To minimize
|
||||
collisions either between files in the same directory with different names or
|
||||
between files in different directories with the same name, this hash is
|
||||
generated using both the (containing) directory's unique GFID and the file's
|
||||
name. This hash is then matched to one of the layout assignments, to yield
|
||||
what we call a *hashed location*. For example, consider the layout shown
|
||||
above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the
|
||||
corresponding file's hashed location would be on Brick B. A second file with a
|
||||
hash of 0xfaceb00c would be assigned to Brick C by the same reasoning.
|
||||
|
||||
## Looking Up Files
|
||||
|
||||
Because layout assignments might change, especially as bricks are added or
|
||||
removed, finding a file involves more than calculating its hashed location and
|
||||
looking there. That is in fact the first step, and works most of the time -
|
||||
i.e. the file is found where we expected it to be - but there are a few more
|
||||
steps when that's not the case. Historically, the next step has been to look
|
||||
for the file **everywhere** - i.e. to broadcast our lookup request to all
|
||||
subvolumes. If the file isn't found that way, it doesn't exist. At this
|
||||
point, an open that requires the file's presence will fail, or a create/mkdir
|
||||
that requires its absence will be allowed to continue.
|
||||
|
||||
Regardless of whether a file is found at its hashed location or elsewhere, we
|
||||
now know its *cached location*. As the name implies, this is stored within DHT
|
||||
to satisfy future lookups. If it's not the same as the hashed location, we
|
||||
also take an extra step. This step is the creation of a *linkfile*, which is a
|
||||
special stub left at the **hashed** location pointing to the **cached**
|
||||
location. Therefore, if a client naively looks for a file at its hashed
|
||||
location and finds a linkfile instead, it can use that linkfile to look up the
|
||||
file where it really is instead of needing to inquire everywhere.
|
||||
|
||||
## Rebalancing
|
||||
|
||||
As bricks are added or removed, or files are renamed, many files can end up
|
||||
somewhere other than at their hashed locations. When this happens, the volumes
|
||||
need to be rebalanced. This process consists of two parts.
|
||||
|
||||
1. Calculate new layouts, according to the current set of bricks (and possibly
|
||||
their characteristics). We call this the "fix-layout" phase.
|
||||
|
||||
2. Migrate any "misplaced" files to their correct (hashed) locations, and
|
||||
clean up any linkfiles which are no longer necessary. We call this the
|
||||
"migrate-data" phase.
|
||||
|
||||
Usually, these two phases are done together. (In fact, the code for them is
|
||||
somewhat intermingled.) However, the migrate-data phase can involve a lot of
|
||||
I/O and be very disruptive, so users can do just the fix-layout phase and defer
|
||||
migrate-data until a more convenient time. This allows new files to be placed
|
||||
on new bricks, even though old files might still be in the "wrong" place.
|
||||
|
||||
When calculating a new layout to replace an old one, DHT specifically tries to
|
||||
maximize overlap of the assigned ranges, thus minimizing data movement. This
|
||||
difference can be very large. For example, consider the case where our example
|
||||
layout from earlier is updated to add a new double-sided brick. Here's a very
|
||||
inefficient way to do that.
|
||||
|
||||
* Brick A (the large one): 0x00000000 to 0x55555555
|
||||
|
||||
* Brick B: 0x55555556 to 0x7fffffff
|
||||
|
||||
* Brick C: 0x80000000 to 0xaaaaaaaa
|
||||
|
||||
* Brick D (the new one): 0xaaaaaaab to 0xffffffff
|
||||
|
||||
This would cause files in the following ranges to be migrated:
|
||||
|
||||
* 0x55555556 to 0x7fffffff (from A to B)
|
||||
|
||||
* 0x80000000 to 0xaaaaaaaa (from B to C)
|
||||
|
||||
* 0xaaaaaaab to 0xbfffffff (from B to D)
|
||||
|
||||
* 0xc0000000 to 0xffffffff (from C to D)
|
||||
|
||||
As an historical note, this is exactly what we used to do, and in this case it
|
||||
would have meant moving 7/12 of all files in the volume. Now let's consider a
|
||||
new layout that's optimized to maximize overlap with the old one.
|
||||
|
||||
* Brick A: 0x00000000 to 0x55555555
|
||||
|
||||
* Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point
|
||||
|
||||
* Brick B: 0xaaaaaaab to 0xd5555554
|
||||
|
||||
* Brick C: 0xd5555555 to 0xffffffff
|
||||
|
||||
In this case we only need to move 5/12 of all files. In a volume with millions
|
||||
or even billions of files, reducing data movement by 1/6 of all files is a
|
||||
pretty big improvement. In the future, DHT might use "virtual node IDs" or
|
||||
multiple hash rings to make rebalancing even more efficient.
|
||||
|
||||
## Rename Optimizations
|
||||
|
||||
With the file-lookup mechanisms we already have in place, it's not necessary to
|
||||
move a file from one brick to another when it's renamed - even across
|
||||
directories. It will still be found, albeit a little less efficiently. The
|
||||
first client to look for it after the rename will add a linkfile, which every
|
||||
other client will follow from then on. Also, every client that has found the
|
||||
file once will continue to find it based on its cached location, without any
|
||||
network traffic at all. Because the extra lookup cost is small, and the
|
||||
movement cost might be very large, DHT renames the file "in place" on its
|
||||
current brick instead (taking advantage of the fact that directories exist
|
||||
everywhere).
|
||||
|
||||
This optimization is further extended to handle cases where renames are very
|
||||
common. For example, rsync and similar tools often use a "write new then
|
||||
rename" idiom in which a file "xxx" is actually written as ".xxx.1234" and then
|
||||
moved into place only after its contents have been fully written. To make this
|
||||
process more efficient, DHT uses a regular expression to separate the permanent
|
||||
part of a file's name (in this case "xxx") from what is likely to be a
|
||||
temporary part (the leading "." and trailing ".1234"). That way, after the
|
||||
file is renamed it will be in its correct hashed location - which it wouldn't
|
||||
be otherwise if "xxx" and ".xxx.1234" hash differently - and no linkfiles or
|
||||
broadcast lookups will be necessary.
|
||||
|
||||
In fact, there are two regular expressions available for this purpose -
|
||||
*cluster.rsync-hash-regex* and *cluster.extra-hash-regex*. As its name
|
||||
implies, *rsync-hash-regex* defaults to the pattern that regex uses, while
|
||||
*extra-hash-regex* can be set by the user to support a second tool using the
|
||||
same temporary-file idiom.
|
||||
|
||||
## Commit Hashes
|
||||
|
||||
A very recent addition to DHT's algorithmic arsenal is intended to reduce the
|
||||
number of "broadcast" lookups the it issues. If a volume is completely in
|
||||
balance, then no file could exist anywhere but at its hashed location.
|
||||
Therefore, if we've already looked there and not found it, then looking
|
||||
elsewhere would be pointless (and wasteful). The *commit hash* mechanism is
|
||||
used to detect this case. A commit hash is assigned to a volume, and
|
||||
separately to each directory, and then updated according to the following
|
||||
rules.
|
||||
|
||||
* The volume commit hash is changed whenever actions are taken that might
|
||||
cause layout assignments across all directories to become invalid - i.e.
|
||||
bricks being added, removed, or replaced.
|
||||
|
||||
* The directory commit hash is changed whenever actions are taken that might
|
||||
cause files to be "misplaced" - e.g. when they're renamed.
|
||||
|
||||
* The directory commit hash is set to the volume commit hash when the
|
||||
directory is created, and whenever the directory is fully rebalanced so that
|
||||
all files are at their hashed locations.
|
||||
|
||||
In other words, whenever either the volume or directory commit hash is changed
|
||||
that creates a mismatch. In that case we revert to the "pessimistic"
|
||||
broadcast-lookup method described earlier. However, if the two hashes match
|
||||
then we can with skip the broadcast lookup and return a result immediately.
|
||||
This has been observed to cause a 3x performance improvement in workloads that
|
||||
involve creating many small files across many bricks.
|
||||
|
||||
[dynamo]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
|
Loading…
x
Reference in New Issue
Block a user