1081 Commits

Author SHA1 Message Date
Pranith Kumar K
6b393b3ab6 cluster/afr: detect in-progress creation in lookup and return ENOENT
if any subvol returned ENOENT while parent entrylk lock was held,
yield and return ENOENT for the entire lookup.

This is how the issue happens:

Multiple clients A, B and C are attempting 'mkdir -p /mnt/a/b/c'

1 Client A is in the middle of mkdir(/a). It has acquired lock.
  It has performed mkdir(/a) on one subvol, and second one is still
  in progress
2 Client B performs a lookup, sees directory /a on one,
  ENOENT on the other, succeeds lookup.
3 Client B performs lookup on /a/b on both subvols, both return ENOENT
  (one subvol because /a/b does not exist, another because /a
  itself does not exist)
4 Client B proceeds to mkdir /a/b. It obtains entrylk on inode=/a with
  basename=b on one subvol, but fails on other subvol as /a is yet to
  be created by Client A.
5 Client A finishes mkdir of /a on other subvol
6 Client C also attempts to create /a/b, lookup returns ENOENT on
  both subvols.
7 Client C tries to obtain entrylk on on inode=/a with basename=b,
  obtains on one subvol (where B had failed), and waits for B to unlock
  on other subvol.
8 Client B finishes mkdir() on one subvol with GFID-1 and completes
  transaction and unlocks
9 Client C gets the lock on the second subvol, At this stage second
  subvol already has /a/b created from Client B, but Client C does not
  check that in the middle of mkdir transaction
10 Client C attempts mkdir /a/b on both subvols. It succeeds on
   ONLY ONE (where Client B could not get lock because of
   missing parent /a dir) with GFID-2, and gets EEXIST from ONE subvol.
This way we have /a/b in GFID mismatch. One subvol got GFID-1 because
Client B performed transaction on only one subvol (because entrylk()
could not be obtained on second subvol because of missing parent dir --
caused by premature/speculative succeeding of lookup() on /a when locks
are detected). Other subvol gets GFID-2 from Client C because while
it was waiting for entrylk() on both subvols, Client B was in the
middle of creating mkdir() on only one subvol, and Client C does not
"expect" this when it is between lock() and pre-op()/op() phase of the
transaction.

Original-author: Anand Avati <avati@redhat.com>
Change-Id: Idca475dbbc2a51e09da6fa0f9e1e37148caef208
BUG: 860210
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4625
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-04-02 18:59:21 -07:00
Venky Shankar
af939370ad cluster/afr: sync xattrs removed on source to sink(s)
xattrs are first removed from sink followed by setting
source xattrs.

Change-Id: I181cb5b785b667bbfc6e40787a2183a8f45de06b
BUG: 906646
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-on: http://review.gluster.org/4656
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2013-04-02 10:29:38 -07:00
Pranith Kumar K
864ac6b7b3 cluster/afr: prevent piggyback on stale pre_op
Here are the logs of a file on which we saw EIO because of size mismatch:
[root@lizzie ~]# grep 38f18204 /var/log/glusterfs/mnt-x-.log
Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4
for offset: 0, len: 7680

Cleared unstable write flag for 38f18204-2840-408e-ae65-c01f4106b8c4:
offset 0 length 7680

Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 7680, len: 71680

Reporting Unstable write for 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset: 79360, len: 15716

fsync completed on 38f18204-2840-408e-ae65-c01f4106b8c4 for
offset 0 length 7680 with changelog status: -1 -1

According to these logs fsync did not happen after writev with
offset: 79360, len: 15716. Which is the reason for this problem.

In total 3 writes came. lets call them w1, w2, w3
w1 does pre_op so pre_op_done[0], pre_op_done[1] counts become 1 and 1
then is_piggyback_post_op() is called for w1 and it returns *false*

w1's fsync is fired

Now w2 and w3 come and see that pre_op_done[0], pre_op_done[1] are both 1,
so pre_op_piggyback[0] and pre_op_piggyback[1] are both incremented twice,
once by w2, one more time by w3 and become 2, 2  ------- Step-A

Now fsync of w1 is complete and it goes ahead with post op and decrements
pre_op_done[0], pre_op_done[1] to 0, 0

Now w2, w3 writevs complete and is_piggyback_post_op will return *true* for
both w2, w3.
So fsync is not fired for both w2, w3

this patch prevents Step-A from happening.

Change-Id: I8b6af1f1875b2cf5f718caa3c16ee7ff3dc96b5c
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4752
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2013-04-02 09:02:35 -07:00
Anand Avati
e0616e9314 dht: improve transform/detransform of d_off (and be ext4 safe)
The scheme to encode brick d_off and brick id into global d_off has
two approaches. Since both brick d_off and global d_off are both 64-bit
wide, we need to be careful about how the brick id is encoded.

Filesystems like XFS always give a d_off which fits within 32bits. So
we have another 32bits (actually 31, in this scheme, as seen ahead) to
encode the brick id - which is typically plenty.

Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
as the d_off is calculated based on a hash function value. This leaves
us no "unused" bits to encode the brick id.

However both these filesystmes (EXT4 more importantly) are "tolerant" in
terms of the accuracy of the value presented back in seekdir(). i.e, a
seekdir(val) actually seeks to the entry which has the "closest" true
offset.

This "two-prong" scheme exploits this behavior - which seems to be the
best middle ground amongst various approaches and has all the advantages
of the old approach:

- Works against XFS and EXT4, the two most common filesystems out there.
  (which wasn't an "advantage" of the old approach as it is borken against
   EXT4)

- Probably works against most of the others as well. The ones which would
  NOT work are those which return HUGE d_offs _and_ NOT tolerant to
  seekdir() to "closest" true offset.

- Nothing to "remember in memory" or evict "old entries".

- Works fine across NFS server reboots and also NFS head failover.

- Tolerant to seekdir() to arbitrary locations.

Algorithm:

Each d_off can be encoded in either of the two schemes. There is no
requirement to encode all d_offs of a directory or a reply-set in
the same scheme.

The topmost bit of the 64 bits is used to specify the "type" of encoding
of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
it indicates that the "small" d_off encoding scheme is used.

The goal of the "small" d_off encoding is to stay as dense as possible
towards the lower bits even in the global d_off.

The goal of the HUGE d_off encoding is to stay as accurate (close) as
possible to the "true" d_off after a round of encoding and decoding.

If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
ID (call it "n").

SMALL d_off
===========

Encoding
--------
    If the top n + 1 bits are free in a brick offset, then we leave the
top bit as 0 and set the remaining bits based on the old formula:

   hi_mask = 0xffffffffffffffff

   hi_mask = ~(hi_mask >> (n + 1))

   if ((hi_mask & d_off_brick) != 0)
       do_large_d_off_encoding ()

   d_off_global = (d_off_brick * N) + brick_id

Decoding
--------
    If the top bit in the global offset is 0, it indicates that this
is the encoding formula used. So decoding such a global offset will
be like the old formula:

   if ((d_off_global & 0x8000000000000000) != 0)
      do_large_d_off_decoding()

   d_off_brick = (d_off_global % N)

   brick_id = d_off_global / N

HUGE d_off
==========

Encoding
--------
   If the top n + 1 bits are NOT free in a given brick offset, then we
set the top bit as 1 in the global offset. The low n bits are replaced
by brick_id.

    low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))

    d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id

    if (d_off_global == 0xffffffffffffffff)
        discard_entry();

Decoding
--------
    If the top bit in the global offset is set 1, it indicates that
the encoding formula used is above. So decoding would look like:

    hi_mask = (0xffffffffffffffff << n)
    low_mask = ~(hi_mask)

    d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)

    brick_id = global_d_off & low_mask

    If "losing" the low n bits in this decoding of d_off_brick looks
"scary", we need to realize that till recently EXT4 used to only
return what can now be expressed as (d_off_global >> 32). The extra
31 bits of hash added by EXT recently, only decreases the probability
of a collision, and not eliminate it completely, anyways. In a way,
the "lost" n bits are made up by decreasing the probability of
collision by sharding the files into N bricks / EXT directories
    -- call it "hash hedging", if you will :-)

Thanks-to: Zach Brown <zab@redhat.com>
Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
BUG: 838784
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4711
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-04-01 13:13:01 -07:00
Anand Avati
0b81f2801b cluster/afr: fix fd leak with unsafe call_resume()
Introduce AFR_CALL_RESUME macro which cleans up frame->local, like
how AFR_STACK_UNWIND etc. do.

Therefore fix leak in afr_fsync() path.

Change-Id: I3855d8e7e84dbc44e05f507563b7f722bf9621b8
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4745
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-28 19:50:57 -07:00
Pranith Kumar K
6ae6f3db02 cluster/afr: fsync before erase xattrs in data self-heal
Added extra fsync to data self-heal code to make sure the
data reached disk before erasing the changelogs

Change-Id: I9e7e6e55cdc49de2b991705d1638946464a9d4f9
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4744
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-03-28 16:25:09 -07:00
Pranith Kumar K
ca6a3d1e39 cluster/afr: piggyback and fsync resume changes
1) pre_op_piggyback should always be decremented.
2) Move fsync resume to just after post_op.
3) fsync stub should be created from afr's local
   not from the final response.

Change-Id: I220bb532eb03bea584292f4dd2e816ad0c3e0cf7
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4741
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-03-28 08:49:55 -07:00
Anand Avati
8909c28c11 cluster/afr: fsync() guarantees POST-OP completion
AFR now provides a stronger guarantee that fsync() returns only
after completely finishing all the deferred/delayed POST-OP on that
open file.

To acheive this we make a stub out of the returning fsync and
register it with the "delayed" frame in afr_changelog_wake_resume().

The delayed frame, after getting woken up and finishing the POST-OP
will call_resume() the registered stub (which UNWINDs the fsync) at
the time of frame destruction.

This provides a guarantee that an application's (or FUSE) fsync()
returns only after finishing up all the previous transactions,
including delayed POST-OPs and UNLOCK.

Change-Id: Iaa955457e2f25088a144fde37ad0444277b5cf49
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4737
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
2013-03-27 22:16:12 -07:00
Anand Avati
ca10fdc81a cluster/afr: ensure DATA operations are made durable before POST-OP
The changelogging scheme of AFR stores information about the state
of all replicas in all replicas (in the extended attribute of the
respective files on each server) in the form of 'pending counts'
of operations (effectively "dirty flags"). These xattrs are blindly
trusted while performing self-heal, and therefore utmost care has
to be taken while updating and maintaing them.

The most critical updation is the clearing of the pending counts
corresponding to the *other* server in the changelog of a given
server. Before clearing the pending count, we need durability
guarantee of the write which was performed on the other server.

To obtain such a guarantee, it may be necessary to explicitly
introduce an fsync() phase (if the file itself wasn't already
opened with O_SYNC).

This patch introduces the detection of unstable stable writes on
a file and issues explicit fsync() on the servers before performing
the POST-OP clearing of pending flags.

Change-Id: I2171b86a74ec91e40e5877eef0a4e7379578ecf7
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4721
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-27 22:15:53 -07:00
Jeff Darcy
76bc5d1b2d dht: make DHT xattr names configurable
This is necessary to support "DHT over DHT" configurations, so that the
upper and lower instances of DHT don't step all over each other.  Why
would we even consider such a thing?  Because it gives us the ability to
do data tiering and rack-aware placement, either by themselves or as
complements to other functionality such as erasure codes or
deduplication which save space but cost performance.  By setting up the
top-level DHT to place data into one of several lower-level DHT pools
based on policy instead of pure elastic hashing, we get better
performance for 90% of accesses and better storage efficiency for 90% of
data, all for relatively low effort.

Change-Id: I72e65c29edfc80babf39f7a2a00090f4588c4070
BUG: 924265
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4694
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-03-21 15:04:31 -07:00
Pranith Kumar K
f325551e4c nfs, afr: Fail lookup only on split-brain
Change-Id: Icee9772f1f1bf5336eb82a4dc13e198424cd4a65
BUG: 921996
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4699
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-20 19:51:05 -07:00
JulesWang
1975c54164 dht: fix a typo
Change-Id: Id6f156957e58aad06bf2602f880c7e4102b80fd1
BUG: 764890
Signed-off-by: JulesWang <w.jq0722@gmail.com>
Reviewed-on: http://review.gluster.org/4679
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-18 11:34:24 -07:00
Pranith Kumar K
35660032d6 cluster/afr: Preserve mtime in self-heal
Problem:
Data self-heal may choose sink iatt to set mtimes.
This happens because after syncing of data is done
self-heal does one more xattrops/fstat to determine
sources sinks to set the inode-ctx. Since this is done
after data syncing and erase of xattrs, old source and
old sink are now sources, but the mtimes of them differ.
Old code just takes the first source from the list and
update mtimes, which could be sink before the self-heal
started.

Fix:
Set mtime from 'sources before syncing'.

Change-Id: Id769e1b99aa4f041eaee775f64cbf2c57b799723
BUG: 918437
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4658
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-12 08:08:52 -07:00
shishir gowda
140e9756a5 cluster/distribute: Fix layout overlaps due to spread-count in selfheal path
We needed to zero out the layout range, before we re-calculate the range.
When spread-count is issued, we would end up with stale ranges in the layout.

Replaced dht_selfheal_dir_xattr with dht_fix_dir_xattr, which correctly resets
the un-used (after re-cal) layouts.

Change-Id: I1a900d15df07335f59356bd23182ccec34381ab2
BUG: 884455
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4647
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2013-03-07 19:51:22 -08:00
Csaba Henk
8203d34f50 cluster xlators: s/-1/GF_CLIENT_PID_GSYNCD/
Change-Id: I03be3cb23684de4ab36cf2953002708466edd580
BUG: 765433
Signed-off-by: Csaba Henk <csaba@redhat.com>
Reviewed-on: http://review.gluster.org/4601
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-03-03 20:35:48 -08:00
Pranith Kumar K
9dac72481b cluster/afr: Turn on eager-lock for fd DATA transactions
Problem:
With the present implementation, eager-lock is issued for
any fd fop. eager-lock is being transferred to metadata
transactions. But the lk-owner is set to local->fd address
only for DATA transactions, but for METADATA transactions
it is frame->root. Because of this unlock on the eager-lock fails
and rebalance hangs.

Fix:
Enable eager-lock for fd DATA transactions

Change-Id: If30df7486a0b2f5e4150d3259d1261f81473ce8a
BUG: 916226
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4588
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-03-01 14:50:58 -08:00
Avra Sengupta
549231dda9 glusterd: Added the validation function for subvols-per-directory
Change-Id: Ie2259023b9001311a2032792639c3093054f6750
BUG: 896431
Signed-off-by: Avra Sengupta <asengupt@redhat.com>
Reviewed-on: http://review.gluster.org/4552
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-02-28 11:31:28 -08:00
shishir gowda
da3f6576e4 cluster/distribute: Prevent spurious multiple defrag crawls
In dht_notify, we used to create a thread to start defrag
crawls after we had heard from all child subvols.
This was in-correct, as a later event, could also trigger the
crawl again(due to the fact that all subvols had responded).

The fix is to make sure, the thread is started only once after
all subvols have responded the first time

Change-Id: Ifc2978b9dc866af2395b79911eca50ab38ff9457
BUG: 916449
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4587
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2013-02-28 10:32:06 -08:00
Anand Avati
202bcaee24 cluster/dht: print hash function munging logs in DEBUG mode
Change-Id: Ia2e6bce80710d103da9d78afdb389ea162b00686
BUG: 912564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4590
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-02-27 14:55:15 -08:00
shishir gowda
dafd31b718 cluster/distribute: Add filter to support file patterns to be migrated
'gluster volume rebalance' command will be enhanced to support passing of these
options/pattern.

<pattern> is comma separated list as show below. The Precedence is from right
to left.

e.g- "*avi,*pdf:10MB,*:1KB"
     The precedence is as follows:
     migrate all files with size equal or greater than 1KB "*:1KB"
     migrate all pdf files with size equal or greater than 10MB "*pdf:10MB"
     migrate all avi files "*avi"

With this option, it is possible to choose which files to migrate.

Change-Id: I6d6d6a015bcbacf1debae2f278a2d92306fb055d
BUG: 896456
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4366
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-27 12:16:07 -08:00
shishir gowda
164c9586ae cluster/distribute: Preserve file size during rebalance migration
If holes are encountered, then we do not write these to the dst,
which sometimes causes file size to be lesser than src. Data is not
corrupted, as when non-zero reads are received, we do write that data.

Calling a truncrate to give file size to prevent it from being
truncated to less than src in case the file end has holes.

Thanks to Brian Foster for providing the test case

Change-Id: I3cdd143b63ec8d797273d76189dff8b05eb9e551
BUG: 915554
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4574
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-25 23:01:28 -08:00
Pranith Kumar K
15d4ae7da9 cluster/afr: Don't queue transactions during open-fd fix
Before Anonymous fds are available, afr had to queue up
transactions if the file is not opened on one of its
subvolumes. This happens until the attempt to open the
file either succeeds or fails. These attempts happen
until the file is successfully opened on the subvolume.
Now client xlator uses anonymous fds to perform the fops
if the fd used for the fop is not 'opened'.
Fops will be successful even when the file is not opened
so there is no need to queue up the transactions anymore in afr.
Open is attempted on the subvolume where it is not
opened independent of the fop.

Change-Id: Id1a4b4ebe6f89f9efe8f6a8247918b91247d0819
BUG: 913051
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4568
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-22 21:03:03 -08:00
Pranith Kumar K
12d865dd59 dht: Enable mem-accounting for nufa
Change-Id: I0cf8a59c81e30603aadc45393bbd49d80ae1621f
BUG: 912657
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4542
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-19 18:55:43 -08:00
Raghavendra Bhat
b371736a58 cluster/afr: do complete split-brain check in all the fd based fops
fd based operations such as readv checked only for data split brain
instead of complete split-brain (i.e both data + metadata) assuming that
open would have done the complete split-brain check. However open-behind
would have unwound open, without winding to afr thus preventing the complete
split-brain check and some appliations will be able to read the contents
of the file even though the file has metadata split-brain. So let all
the fd based fops do a defensive check of complete split-brain.

Change-Id: Ia90b35f2b08426dfcad804b7f8105278c86fbd2d
BUG: 846240
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4548
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-19 16:06:28 -08:00
Jeff Darcy
59ac567c8b distribute: add hash-name-regex option
This is to support the common "write to temp file then rename" idiom. In our
case this causes us to create a linkfile during the rename in (N-1)/N cases,
with a significant impact on performance e.g. for UFO which uses this idiom.
This patch allows the user to specify up to two regular expressions that
separate the permanent and transient parts of a temp-file name:

    rsync-hash-regex reimplements the existing RSYNC_FRIENDLY_NAME
    pattern where the temporary name for EXAMPLE is .EXAMPLE.suffix

    extra-hash-regex can be used for a second pattern, active concurrently,
    and can include alternatives via the POSIX extended-regex syntax

Change-Id: Ic1a6ed19324bc31fefe91ee34b8478877a9c5d2c
BUG: 912564
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/4116
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-18 20:50:48 -08:00
Amar Tumballi
1d172d6ee1 cluster/dht: improvement in rebalance logs
provide space between path and next string, so that we can grep
for the correct path in error logs.

Change-Id: Ide53e341864dce432406a58e8cbb2ff1480cfbde
Signed-off-by: Amar Tumballi <amarts@redhat.com>
BUG: 815194
Reviewed-on: http://review.gluster.org/4531
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
2013-02-17 22:00:43 -08:00
shishir gowda
c87472e200 cluster/distribute: Reopen fds in migration internally as root:root
Though linkfile_create and rebalance dst file create sent a setattr
with correct ownership, there is still a race window where the linkfile
open (client open due to migration) will fail, as its ownership will be
root:root.

Change-Id: I056092da6102319efa3bb9f1795f8c97db2a3fed
BUG: 884597
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4513
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-14 20:26:58 -08:00
shishir gowda
5d29e59866 cluster/distribute: Remove suprious fd_unref call
After fix http://review.gluster.org/4282 (libglusterfsterfs/syncop: do not
hold ref on the fd in cbk) was pushed, syncop_open does not take a ref anymore.

Change-Id: I5543986a4f2d965eef42ca979b39ac75439cec49
BUG: 910661
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4509
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-14 20:26:16 -08:00
shishir gowda
a42490385d cluster/dht: Create linkfile with file uid/gid
Currently, linkfile creation happens as root.

use uid/gid returned from _cbk (link/rename) to set the correct ownership of
the link files.

Also added test/dht.rc to implement common dht functions

Change-Id: Iecdcf52d89fed8286670ce430b9d19deebd27b8c
BUG: 884597
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4304
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-13 16:17:44 -08:00
Venky Shankar
19de18219b cluster/dht: pathinfo xattr changes for directories
Since directories have presence on all subvolumes there is
no definite meaning of ->hashed_subvol or ->cached_subvol.
getxattr() code path chooses ->cached_subvol for pathinfo
extended attribute. While this makes sense of files, it makes
less sense for directories. Further if a hashed or a cached
subvolume is down, and there's a getxattr request for a
directory, we return with an errno.

This patch changes pathinfo extended attribute contents by
aggregating information from all subvolumes that are up.

Change-Id: I58adb741d63ccfd1d0239af75eb65f26f0fb384d
Signed-off-by: Venky Shankar <vshankar@redhat.com>
BUG: 856455
Reviewed-on: http://review.gluster.org/4047
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-08 19:09:46 -08:00
Anand Avati
d3e7881ecd Use proper libtool option -avoid-version instead of bogus -avoidversion
Change-Id: I1c9541058c7d07786539a3266ca125a6a15287d8
BUG: 859835
Signed-off-by: Anand Avati <avati@redhat.com>
Original-author: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Signed-off-by: Kacper Kowalik (Xarthisius) <xarthisius.kk@gmail.com>
Reviewed-on: http://review.gluster.org/3967
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-02-07 15:12:56 -08:00
Anand Avati
dc2da4a3d9 afr: serialize modification of {entrylk,inodelk}_lock_count
Typically this lock was not needed in practice, but with
http://review.gluster.org/3842, this code gets executed in multiple
threads for different servers and we lose a count. This results in
leaked lock and a hang for a future transaction.

Change-Id: I377ed20e44f2a45cff522289dfef181f0653eca2
BUG: 765564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4480
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-02-07 11:09:35 -08:00
Jeff Darcy
da9d54cac6 dht: better layout-optimization algorithm
This method deals with the case where swapping might gain a bigger overlap
for the xlator currently under consideration, but sacrifices even more from
the xlator we're swapping with. For example:

A = 0x00000000 - 0x44444443 (new 0x00000000 - 0x55555554)
B = 0x44444444 - 0x77777776 (new 0x55555555 - 0xaaaaaaa9)
C = 0x77777777 - 0xffffffff (new 0xaaaaaaaa - 0xffffffff)

Here, the new range for B has a bigger overlap with the old C than with the
old B (0x33333333 vs. 0x22222222 to be precise) so looking only at that
might lead us to swap. However, such a swap turns the new C's overlap from
0x55555556 (vs. old C) to *zero* (vs. old B).  In other words, we've gained
0x11111111 for B but lost 0x55555556 for C, so it's a bad idea.

The new algorithm accounts for all effects of the swap, so it not only avoids
bad swaps but can make some good ones that would have been missed previously.
For example, if swapping a range X with a later range Y would not increase the
overlap for X we would previously have skipped it even if the swap would
increase Y's overlap without affecting X's.  This is the normal case when we're
adding a new brick (which initially has zero overlap with any old range) so
finding more good swaps is probably even more important than avoiding bad ones.

Also, the logic in dht_overlap_calc was completely broken before, causing
integer overflows instead of providing correct values, so no matter what
higher-level algorithm was in place the GIGO effect would have resulted in
bad decisions.

Change-Id: If61ed513cfcb931916c6b51da293e3efbaaf385f
BUG: 853258
Signed-off-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-on: http://review.gluster.org/3908
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-07 08:27:40 -08:00
Pranith Kumar K
3a141cda38 cluster/afr: Avoid priv->eager_lock value update race
Change-Id: I7049c0c64e36a9dfa4cc0e0b34de7ec111d2f6c1
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4076
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-06 09:25:42 -08:00
Pranith Kumar K
e0a331c4be cluster/afr: Perform wakeup just before fop
There is no necessity for the delayed-post-op to wait until
the next fop phase on the fd completes. Change-log,
locks are inherited by the time next fop phase is attempted
so the wakeup can happen just before the fop phase is started.

Change-Id: I0b8e591f591b0f7565eb55265ab51f476ed2b165
BUG: 908302
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4073
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-06 09:24:31 -08:00
Raghavendra Talur
2a46c8769b cluster/dht: Correct min_free_disk behaviour
Problem:
Files were being created in subvol which had less than
min_free_disk available even in the cases where other
subvols with more space were available.

Solution:
Changed the logic to look for subvol which has more
space available.
In cases where all the subvols have lesser than
Min_free_disk available , the one with max space and
atleast one inode is available.

Known Issue: Cannot ensure that first file that is
created right after min-free-value is crossed on a
brick will get created in other brick because disk
usage stat takes some time to update in glusterprocess.
Will fix that as part of another bug.

Change-Id: If3ae0bf5a44f8739ce35b3ee3f191009ddd44455
BUG: 858488
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Reviewed-on: http://review.gluster.org/4420
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-04 08:43:50 -08:00
Varun Shastry
50f0882051 cluster/stripe: Mount issues with Stripe xlator
Problem:
* 'CONNECTING' is taken as CHILD_UP.
* Sending notifications (default_notify()) for all the events individually
   while mounting.

Solution:
* Consider Child up only after the event CHILD_UP is received.
* Send a single notification for all the children's events only
   while mounting.

Change-Id: I1b7de127e12f5bfb8f80702dbdce02019e138bc8
BUG: 885072
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4356
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-02-03 15:29:59 -08:00
Anand Avati
80d08f13b0 cluster/dht: ignore EEXIST error in mkdir to avoid GFID mismatch
In dht_mkdir_cbk, EEXIST error is treated like a true error. Because
of this the following sequence of events can happen, eventually
resulting in GFID mismatch and (and possibly leaked locks and hang,
in the presence of replicate.)

The issue exists when many clients concurrently attempt creation of
directory and subdirectory (e.g mkdir -p /mnt/gluster/dir1/subdir)

0. First mkdir happens by one client on the hashed subvolume. Only
   one client wins the race. Others racing mkdirs get EEXIST. Yet
   other "laggers" in the race encounter the just-created directory
   in lookup() on the hash dir.

1. At least one "lagger" lookup() notices that there are missing
   directories on other subvolumes (which the "winner" mkdir is yet
   to create), and starts off self-heal of the directory.

2. At least on some subvolumes, self-heal's mkdir wins the race
   against the "winner" mkdir and creates the directory first. This
   causes the "winner" mkdir to experience EEXIST error on those
   subvolumes.

3. On other subvolumes where "winner" mkdir won the race, self-heal
   experiences EEXIST error, but self-heal is properly translating
   that into a success (but mkdir code path is not -- which is the
   bug.)

4. Both mkdir and self-heal assign hash layouts to the just created
   directory. But self-heal distributes hash range across N (total)
   subvolumes, whereas mkdir distributes hash range across N - M
   (where M is the number of subvolumes where mkdir lost the race).
   Both the clients "cache" their respective layouts in the near
   future for all future creates inside them (evidence in logs)

5. During the creation of the subdirectory, two clients race again.
   Ideally winner performs mkdir() on the hashed subvolume and proceeds
   to create other dirs, loser experiences EEXIST error on the hashed
   subvolume and backs off. But in this case, because the two clients
   have different layout views of the parent directory (because of
   different hash splits and assignements), the hashed subvolumes for
   the new directory can end up being different. Therefore, both clients
   now win the race (they were never fighting against each other on a
   common server), assigning different GFIDs to the directory on their
   respective (different) subvolumes. Some of the remaining subvolumes
   get GFID1, others GFID2.

Conclusion/Fix:
   Making mkdir translate EEXIST error as success (just the way self-heal
   is already rightly doing) will bring back truth to the design claim
   that concurrent mkdir/self-heals perform deterministic + idempotent
   operations. This will prevent the differing "hash views" by different
   clients and thereby also avoid GFID mismatch by forcing all clients
   to have a "fair race", because the hashed subvolume for all will be
   the same (and thereby avoiding leaked locks and hangs.)

Change-Id: I84592fb9b8a3f739a07e2afb23b33758a0a9a157
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4459
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
2013-02-03 12:14:19 -08:00
Venkatesh Somyajula
454c6c0fde cluster/afr: added logging of changelog for split-brain in glustershd.log file
Change-Id: Iaf119f839cb2113b8f8efb7bf7636d471b6541bf
BUG: 866440
Signed-off-by: Venkatesh Somyajula <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4385
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-02-03 11:48:01 -08:00
Varun Shastry
315ee9c4e0 cluster/dht: stack wind with cookie
Default_fops uses stack_wind_tail. It winds without creating the frame leading
into wrong subvol return in the cookie. To avoid the problem caused by the
same, we're getting the subvol by passing the cookie.

Change-Id: I51ee79b22c89e4fb0b89e9a0bc3ac96c5b469f8f
BUG: 893338
Signed-off-by: Varun Shastry <vshastry@redhat.com>
Reviewed-on: http://review.gluster.org/4388
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
2013-01-31 17:18:03 -08:00
Raghavendra Bhat
e979c0de9d libglusterfs/syncop: do not hold ref on the fd in cbk
* Do not do fd_ref in cbks of the fops which return a fd (such as
  open, opendir, create).

Change-Id: Ic2f5b234c5c09c258494f4fb5d600a64813823ad
BUG: 885008
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4282
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-30 23:40:37 -08:00
Raghavendra Bhat
326a47939d cluster/afr: if a subvolume is down wind the lock request to next
When one of the subvolume is down, then lock request is not attempted
on that subvolume and move on to the next subvolume.

/* skip over children that are down */
                while ((child_index < priv->child_count)
                       && !local->child_up[child_index])
                        child_index++;

In the above case if there are 2 subvolumes and 2nd subvolume is down (subvolume
1 from afr's view), then after attempting lock on 1st child (i.e subvolume 0)
child index is calculated to be 1. But since the 2nd child is down child_index
is incremented to 2 as per the above logic and lock request is STACK_WINDed to
the child with child_index 2. Since there are only 2 children for afr the child
(i.e the xlator_t pointer) for child_index will be NULL. The process crashes
when it dereference the NULL xlator object.

Change-Id: Icd9b5ad28bac1b805e6e80d53c12d296526bedf5
BUG: 765564
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4438
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-29 12:50:55 -08:00
Pranith Kumar K
360868e9b0 cluster/afr: wakeup delayed post op on fsync
Change-Id: I5d84ef72615f9d71b4af210976e2449de6e02326
BUG: 888174
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4446
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-29 12:33:05 -08:00
Pranith Kumar K
36a9ac82f0 cluster/afr: Change order of unwind, resume for writev
Generally inode-write fops do transaction.unwind then
transaction.resume, but writev needs to make sure that
delayed post-op frame is placed in fdctx before unwind
happens. This prevents the race of flush doing the
changelog wakeup first in fuse thread and then this
writev placing its delayed post-op frame in fdctx.
This helps flush make sure all the delayed post-ops are
completed.

Change-Id: Ia78ca556f69cab3073c21172bb15f34ff8c3f4be
BUG: 888174
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4428
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-29 12:29:51 -08:00
Raghavendra Bhat
99e63168c4 cluster/afr: before checking lock_count of internal lock make sure its not
entrylk

when the expected lock count is equal to the attempted lock count, then before
deciding that lock is failed on all the nodes, make sure the lock type is
checked properly.

Change-Id: I1f362d54320cb6ec5654c5c69915c0f61c91d8c7
BUG: 765564
Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
Reviewed-on: http://review.gluster.org/4436
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-28 00:20:46 -08:00
Anand Avati
1939519979 replicate: fix lock counting in blocking lock path
As of http://review.gluster.org/2828, the blocking lock code
path's condition for checking completion of locking atempt is
broken. The condition -

	if ((child_index == priv->child_count) || ...)

and

	if ((child_index == priv->child_count) && ...)

which is retained to check completion of blocking lock attempts
for DATA/METADATA transaction will _always_ fail because a few
lines above we have -

      child_index = cookie % priv->child_count;

So child_index will never equal priv->child_count. This leaves
the correctness at the mercy of the next part of the
conditional -

	.. (int_lock->lock_count == int_lock->lk_expected_count) ..

This "works" as long as no server went down during the transaction.

If a server goes down in the middle of the transaction, then this
condition also fails, and the code wraps around and starts a
blocking lock attempt loop all the way again from from the first server.
This results in double locks getting acquired on those servers, and
eventually the second condition gets hit (first condition is _never_
hit) and we come out of locking phase.

During unlock phase we perform only one unlock per server leaving the
other lock "leaked" forever.

Change-Id: I7189cdf3f70901b04647516fe1d1e189f36cc8dd
BUG: 765564
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4433
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-01-26 00:20:03 -08:00
shishir gowda
6fd654dc94 cluster/distribute: get_layout should account only available subvols
The earlier logic used to check if (layout-spread-count <= subvol_cnt -
decommissioned bricks). With this if a subvol was down, and layout-spread was >
upsubvols, a mkdir ended up creating holes in the layout.

The fix is to consider only the combination of subvols which are usable (not
down or not decommissioned).

Change-Id: I61ad3bcaf4589f5a75f7887cfa595c98311ae3bb
BUG: 902610
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/4412
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-23 23:43:39 -08:00
Krishnan Parthasarathi
1ab086d863 afr: Modified book-keeping structures for entrylks
* There are upto 3 entry lockees that may be needed to perform
  entrylk'ing in posix dir-write operations.

* For eg, rmdir ("/a/b") needs to acquire locks on two entities,
  - entrylk ("/a", "b")
  - entrylk ("/a/b", null)

* Changed existing entrylk/rename/selfheal (entrylk) transactions
  to use the new book-keeping structures

* Fixed few issues in afr_trace_entry_lk{in,out} functions. Tracing is now
  aware of the new entry lockee structure.

Implementation notes:
* Changed 'cookie' sent in stack_wind to encode lockee_entity_no
  and subvol_no.

  cookie is a non-negative integer such that 0 <= cookie < replica_count,
  When more than one lock is being acquired across the subvolumes,
  cookie % replica_count gives the subvol_no
  cookie / replica_count gives the lockee_entity_no.

Change-Id: Idbf41803387a7d59a0f7fcb1453d91cea74da153
BUG: 765564
Signed-off-by: Krishnan Parthasarathi <kp@gluster.com>
Reviewed-on: http://review.gluster.org/2828
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-23 09:17:00 -08:00
Pranith Kumar K
91ac9f9741 cluster/afr: Remove strict-readdir implementation
Leaving option frame-work un-changed for backward compatibility.

Change-Id: I40bce1ec360801307e67f09e53b0721f64efab37
BUG: 886998
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4309
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-01-23 01:26:09 -08:00
Pranith Kumar K
e908659dfe self-heald: Remove stale index even in heal info
Change-Id: Ic1c9559aec59c1fb9dfede4aba8895f3b86f32f1
BUG: 861015
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4098
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
2013-01-22 19:43:50 -08:00