1081 Commits

Author SHA1 Message Date
Ravishankar N
dfe698bae7 afr: check for split-brain before proceeding with fops
Bail out of fops if split brain has been detected during lookup

Change-Id: Id387dbb1a25eec4a121dedceadc6069bdea24b5d
BUG: 1010834
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Reviewed-on: http://review.gluster.org/5988
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-16 19:30:14 -07:00
Kaushal M
1cf9256707 dht: dht_lookup_dir_cbk should set op_errno as local->op_errno
Two glusterfs clients return inconsistent errnos when the bricks of the volume
were down. Consider two gluster mounts. Mount 1 was done when the bricks were
online. Mount 2 was done after the bricks were killed, (using the 'glusterfs'
command instead of the mount script).

For any request, mount 1 will return ENOTCONN, where as mount 2 will return
ENOENT.

This happens because for the 2nd mount, a fuse would send a lookup on '/' for
any request, as it hadn't been done yet. The client xlator returns ENOTCONN,
but the dht_lookup_dir_cbk changed this to ENOENT unconditionally when
aggregating. So, fuse returned ENOENT, even though the errno should have been
ENOTCONN.

Change-Id: I4b7a6d84ce5153045a807fccc01485afe0377117
BUG: 1019095
Signed-off-by: Kaushal M <kaushal@redhat.com>
Reviewed-on: http://review.gluster.org/6072
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Anand Avati <avati@redhat.com>
2013-10-15 00:38:07 -07:00
Harshavardhana
6836118b21 libglusterfs: Add monotonic clocking counter for timer thread
gettimeofday() returns the current wall clock time and timezone.
Using these functions in order to measure the passage of time
(how long an operation took) therefore seems like a no-brainer.

This time suffer's from some limitations:

a. They have a low resolution: “High-performance” timing by
definition, requires clock resolutions into the microseconds
or better.

b. They can jump forwards and backwards in time: Computer
clocks all tick at slightly different rates, which causes
the time to drift. Most systems have NTP enabled which
periodically adjusts the system clock to keep them in sync
with “actual” time. The adjustment can cause the clock to
suddenly jump forward (artificially inflating your timing
numbers) or jump backwards (causing your timing calculations
to go negative or hugely positive). In such cases timer
thread could go into an infinite loop.

From 'man gettimeofday':
----------
..
..
The time returned by gettimeofday() is affected by discontinuous
jumps in the system time (e.g., if the system administrator manually
changes the system time).  If you need a monotonically increasing
clock, see clock_gettime(2).
..
..
----------

Rationale:

For calculating interval timing for Timer thread, all that’s
needed should be clock as a simple counter that increments
at a stable rate.

This is necessary to avoid the jumps which are caused by using
"wall time", this counter must be monotonic that can never
“tick” backwards, ever.

Change-Id: I701d31e71a85a73d21a6c5cd15583e7a5a645eeb
BUG: 1017993
Signed-off-by: Harshavardhana <harsha@harshavardhana.net>
Reviewed-on: http://review.gluster.org/6070
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-15 00:14:57 -07:00
Venkatesh Somyajulu
75caba6371 cluster/afr: [Feature] Command implementation to get heal-count
Currently to know the number of files to be healed, either user
has to go to backend and check the number of entries present in
indices/xattrop directory. But if a volume consists of large
number of bricks, going to each backend and counting the number
of entries is a time-taking task. Otherwise user can give
gluster volume heal vol-name info command but with this
approach if no. of entries are very hugh in the indices/
xattrop directory, it will comsume time.

So as a feature, new command is implemented.

Command 1: gluster volume heal vn statistics heal-count
This command will get the number of entries present in
every brick of a volume. The output displays only entries
count.

Command 2: gluster volume heal vn statistics heal-count
           replica 192.168.122.1:/home/user/brickname

           Here if we are concerned with just one replica.
So providing any one of the brick of a replica will get
the number of entries to be healed for that replica only.

Example:
Replicate volume with replica count 2.

Backend status:
--------------
[root@dhcp-0-17 xattrop]# ls -lia | wc -l
1918

NOTE: Out of 1918, 2 entries are <xattrop-gfid> dummy
entries so actual no. of entries to be healed are
1916.

[root@dhcp-0-17 xattrop]# pwd
/home/user/2ty/.glusterfs/indices/xattrop

Command output:
--------------
Gathering count of entries to be healed on volume volume3 has been successful

Brick 192.168.122.1:/home/user/22iu
Status: Brick is Not connected
Entries count is not available

Brick 192.168.122.1:/home/user/2ty
Number of entries: 1916

Change-Id: I72452f3de50502dc898076ec74d434d9e77fd290
BUG: 1015990
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/6044
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-14 14:41:54 -07:00
Venkatesh Somyajulu
047882750e cluster/afr : Implementation of command "gluster volume heal vn statistics"
"gluster volume heal volumename statistics" command gives the summary
of the afr crawl done based on the entries present in the xattrop
directory. Whenever afr crawls are attempted, the beginning time of
crawl, end time of crawl, no of files healed, heal-failed count and
number of files in split brain are shown along with the type of the
crawl. If crawl is already in progress then it will give the number
of files healed, heal failed count and number of files in split-brain
from the beginning of the crawl and instead of telling the end time of
the crawl, "CRAWL IN PROGRESS" message will be shown.

Output format:
command: "gluster volume heal volume-name statistics"
Output:
Gathering afr crawl statistics crawl statistics on volume volume-name
has been successful
------------------------------------------------

Crawl statistics for brick no 0
Hostname of brick 192.168.122.248

Starting time of crawl: Wed Jul 10 15:52:38 2013

Ending time of crawl: Wed Jul 10 15:52:38 2013

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 0

Starting time of crawl: Wed Jul 10 15:52:38 2013

Ending time of crawl: Wed Jul 10 15:52:38 2013

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 0

------------------------------------------------

Crawl statistics for brick no 1
Hostname of brick 192.168.122.1

Starting time of crawl: Wed Jul 10 15:52:42 2013

Ending time of crawl: Wed Jul 10 15:52:42 2013

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 0

Starting time of crawl: Wed Jul 10 15:52:42 2013

Ending time of crawl: Wed Jul 10 15:52:42 2013

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 0

--------------------------------------------------

Change-Id: I10bf9d10b005741db9973fb1352e0dd59ed99aa9
BUG: 949400
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/4790
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-14 14:41:44 -07:00
Pranith Kumar K
86f936f438 cluster/afr: Handle quota size xattr separately in lookup
Quota size xattrs are not maintained by afr. There is a
possibility that they differ even when both the directory
changelog xattrs suggest everything is fine. So if there is at
least one 'source' check among the sources which has the maximum
quota size. Otherwise check among all the available ones for
maximum quota size. This way if there is a source and stale copies
it always votes for the 'source'.

Change-Id: Ia222379cbafa7043dd03f533c105860f2c7b8b0d
BUG: 1016683
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/6052
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Varun Shastry <vshastry@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-10 10:38:53 -07:00
Pranith Kumar K
c32db94e29 cluster/afr: Change Self-heal domain separator to ':'
'-' can be present in a volume. This may lead to domain
collisions in future.

Tests:
Checked in gdb that domain comes with ':' separator:

Breakpoint 1, pl_common_inodelk (frame=0x7fdabcce51a4,
this=0x8bde20, volume=0x8b50d0 "r2-replicate-0:self-heal",
inode=0x7fdab822f0e8, cmd=6, flock=0x7fdabc76eee4,
loc=0x7fdabc76ede4, fd=0x0, xdata=0x7fdabc6e0ab0) at inodelk.c:597

Change-Id: I4456ae35ac8bf21e6361c34e9ad437f744a2e84b
BUG: 993981
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/6025
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-10-03 21:28:15 -07:00
Anuradha
2526eea5dd Logging : Improved the log message on unlock failure in afr_unlock_inodelk_cbk.
"unlock failed on 1 unlock" seems meaningless in the log message. Improved it.

Change-Id: If67d3f9d4aa5310d0b6728a6c89fa58a5cc93d12
BUG: 1012947
Signed-off-by: Anuradha <atalur@redhat.com>
Reviewed-on: http://review.gluster.org/6012
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-09-30 12:54:17 -07:00
Santosh Kumar Pradhan
e9554f7792 gNFS: Incorrect NFS ACL encoding for XFS
Problem:
Incorrect NFS ACL encoding causes "system.posix_acl_default"
setxattr failure on bricks on XFS file system. XFS (potentially
others?) doesn't understand when the 0x10 prefix is added to the
ACL type field for default ACLs (which the Linux NFS client adds)
which causes setfacl()->setxattr() to fail silently. NFS client
adds NFS_ACL_DEFAULT(0x1000) for default ACL.

FIX:
Mask the prefix (added by NFS client) OFF, so the setfacl is not
rejected when it hits the FS.

Original patch by: "Richard Wareing"

Change-Id: I17ad27d84f030cdea8396eb667ee031f0d41b396
BUG: 1009210
Signed-off-by: Santosh Kumar Pradhan <spradhan@redhat.com>
Reviewed-on: http://review.gluster.org/5980
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-29 16:54:37 -07:00
Anand Avati
84fa8af38d core: block unused signals in created threads
Block all signal except those which are set for explicit handling
in glusterfs_signals_setup(). Since thread spawning code in
libglusterfs and xlators can get called from application threads
when used through libgfapi, it is necessary to do this blocking.

Change-Id: Ia320f80521a83d2edcda50b9ad414583a0175281
BUG: 1011662
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5995
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-09-25 01:33:16 -07:00
shishir gowda
83937d1666 Revert "cluster/dht: Return success in dht_discover if layout issues"
This reverts commit a3e593f9f17cb1e68db97bb5a0d8074793a33964 which
was bought into fix dht_layout_anomalies error handling.

We still see applications error'ing out due to graph switches.

To fix the above issue -

We cannot heal in dht_discover, as it is a gfid based lookup, and not
path based. So, returning error here would lead to app's to see failure.

Also, update the layout in inode_ctx even if it has anomalies. Let
subsequent heals fix the issue.

Conflicts:
	xlators/cluster/dht/src/dht-common.c

Signed-off-by: shishir gowda <sgowda@redhat.com>
Change-Id: I68c1056c3587e04a02344548546ddd06034489c5
BUG: 960348
Reviewed-on: http://review.gluster.org/5443
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-09-24 23:48:04 -07:00
shishir gowda
711484d759 cluster/dht: Fix anomaly check
We were wrongly detecting holes/overlaps for already accounted
errors. Additionally, sort should also handle zero'ed out layout

Change-Id: Ic3d13e1d735b914f9acc01fe919bc90656baea48
BUG: 1003851
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5762
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-23 11:40:33 -07:00
Harshavardhana
66747c96e6 distribute: Rebalance should provide even disk space distribution
Earlier disk space check had an issue which didn't
provide the needed functionality to avoid migration
when the destination had lesser available space,
scenario we need to avoid is stated below :

During rebalance `migrate-data` - Destination subvol experiences
a `reduction` in 'blocks' of free space, at the same time source
subvol gains certain 'blocks' of free space. A valid check is
necessary here to avoid errorneous move to destination where
the space could be scantily available.

This patch provides a proper fix in place by subtracting
necessary file blocks from destination and adding those blocks
to source.

Change-Id: I9c7840716a4256ef614ffc0fbfd9f2b456ac28c8
BUG: 982919
Signed-off-by: Harshavardhana <harsha@harshavardhana.net>
Reviewed-on: http://review.gluster.org/5961
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Shishir Gowda <sgowda@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-19 09:25:28 -07:00
Ravishankar N
c550ae6952 cli/glusterd: improve rebalance fix-layout status reporting
Problem:
Currenly the CLI rebalance status command output does not indicate the
'type' of rebalance, i.e. whether a full rebalance or only a fix-layout
was carried out.

Fix: After the rebalance status of all peers is received by the
originator glusterd, alter it to reflect the type of rebalance
before passing it on to the CLI process.

Change-Id: I1940ffda0d36e25e5b33c84a0ea210394cc9e1d3
BUG: 1004744
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Reviewed-on: http://review.gluster.org/5826
Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-19 09:22:36 -07:00
Pranith Kumar K
f86a37bddf cluster/afr: Have common inode-write-fop cbk
Change-Id: Ia7b324b86d6a7051d187106d7a060155e77defc5
BUG: 910217
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5238
Reviewed-by: Ravishankar N <ravishankar@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-18 14:01:52 -07:00
Anand Avati
de2a8d3033 Revert "cluster/distribute: Rebalance should also verify free inodes"
This reverts commit 215fea41a96479312a5ab8783c13b30ab9fe00fa

Realized soon after merging, that this patch is actually useless.
Checking for inode utilization on dst relative to src is of no use
because dst is already storing linkfile and inode is already consumed
no matter what. We might as well perform the rebalance and save on an
inode on the src node.

Change-Id: I7b8b8d35d9063a710e1bee1c7c6c7739fcc45ad4
Reviewed-on: http://review.gluster.org/5960
Reviewed-by: Harshavardhana <harsha@harshavardhana.net>
Tested-by: Anand Avati <avati@redhat.com>
2013-09-17 17:45:25 -07:00
Harshavardhana
215fea41a9 cluster/distribute: Rebalance should also verify free inodes
Currently during `MIGRATE_DATA` we never verified about the total
inode usage among new and old bricks. Such checks are available for
disk space usage but it is also needed for inodes during data
migration. Such a check leads to uniform outcome of file distribution
upon rebalance.

Patch provides:

- Check dst_inodes < src_inodes, a friendly `warning` message to
  indicate we have to ignore such an attempt
- Rename __dht_check_free_space() --> __dht_check_free_space_and_inodes()

Change-Id: I7bc4dd8b507883f0fb836300e99f0bb083493f5f
BUG: 982919
Signed-off-by: Harshavardhana <harsha@harshavardhana.net>
Reviewed-on: http://review.gluster.org/5948
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-17 16:37:24 -07:00
Anand Avati
ebcf1c8ddb cluster/dht: assign layout onto missing directories too
The current self-healing algorithm is ignoring missing directories
for assigning new layout. When lookup() is racing against mkdir()
or when self-healing a half-done mkdir(), the layout assignment split
must happen based on the final number of directories, and not the
currently existing number of directories (because we finish mkdir()
of missing directories before hash layout assignment).

Without this fix, concurrent mkdir() and lookup() will step on
each others feet, create a messed up layout on disk, and end up
with different in-memory layouts.

Once two clients have different in-memory layouts, creation of
subdirectory will not arbitrate on the same hashed subvolume and will
result in GFID mismatch of the sub-directory.

Change-Id: Ia47acad67c265060405984c822b4d37512b9dbb3
BUG: 907072
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5849
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Peter Portante <pportant@redhat.com>
Tested-by: Peter Portante <pportant@redhat.com>
2013-09-09 14:58:09 -07:00
Pranith Kumar K
2c15621d26 cluster/afr: Set size based source only when sizes are unequal
Change-Id: I18583f14edf1011401be15744371e2a6b79d75cc
BUG: 1003842
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5763
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-09-03 21:37:44 -07:00
Anand Avati
6f85f6ce64 afr: make NOP truncate/ftruncate efficient
If truncate/ftruncate is called with the offset as the current size
of file, then skip the durability fsync and unwind quickly.

Change-Id: I0baec68d96c6d4d8217d33bd9738f7ed0d1b40c5
BUG: 958118
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5737
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-09-03 16:09:09 -07:00
Venkatesh Somyajulu
85b6dcfb53 cluster/afr: Improvement in logging of self heal completion status
Additional information for source and sinks are added.

Change-Id: I1704956ff86ac3ae36744efe7499c1d1c43faeaf
BUG: 968301
Signed-off-by: Venkatesh Somyajulu <vsomyaju@redhat.com>
Reviewed-on: http://review.gluster.org/5638
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-29 15:50:28 -07:00
Pranith Kumar K
7dd4be82b1 cluster/afr: Reset attempted count before attempting blocking lock
Problem:
internal_lock->lk_attempted_count keeps track of the number of blocking
locks attempted. lk_expected_count keeps track of the number locks expected.
Here are the sequence of steps that happen which lead to the illution that
a full file lock is achieved, even without attempting any lock.

2 mounts are doing dd on same file. Both of them witness a brick going
down and coming back up again. Both of the mounts issue self-heal
1) Both mount-1, mount-2 attempt full file locks in self-heal domain.
lets say mount-1 got the lock, mount-2 attempts blocking lock.

2) mount-1 attempts full file lock in data domain. It goes into blocking
mode because some other writes are in progress. Eventually it gets the lock.
But this results in lk_attempted_count to be still as 2 and will not be reset.
It completes syncing the data.

3) mount-1 before unlocking final small range lock attempts full file lock in
data domain to figure out the source/sink. This will be put into blocked mode
again because some other writes are in progress. But this time seeing the
stale value of lk_attempted_count being equal to lk_expected_count, blocking_lock
phase thinks it completed locking without acquiring a single lock :-O.

4) mount-1 reads xattrs without any lock but since it does not modify the xattrs,
no harm is done by this phase. It tries to do unlocks and the unlocks will fail
because the locks are never taken in data domain. mount-1 also unlocks
self-heal domain locks.

Our beloved mount-2 now gets the chance to cause horror :-(.

5) mount-2 gets the full range blocking lock in self-heal domain.
Please note that this sets lk_attempted_count to 2.

6) mount-2 attempts full range lock in data domain, since there are still
writes on going, it switches to blocking mode. But since lk_attempted_count is 2
which is same as lk_expected_count, blocking phase locks thinks it actually got
the full range locks even though not a single lock request went out the wire.

7) mount-2 reads the change-log xattrs, which would give the number of operations
in progress (lets call this 'X'). It does the syncing and at the end of the sync
decrements the changelog by 'X'. But since that 'X' was introduced by 'X' number
of transactions that are in progress, they also decrement the changelog by 'X'.
Effectively for 'X' operations 'X' number of pre-ops are done but 2 times 'X'
number of post-ops are done resulting in -ve changelog numbers.

Fix:
Reset the lk_attempted_count and inode locks array that is used to remember locks
that are granted.

Change-Id: Ic0a79cd16f32392ea7c790511343c73592bbe6bd
BUG: 1002698
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5736
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-29 12:29:13 -07:00
Anand Avati
c8ccfda3a7 cluster/afr: unlock before aborting transaction
Else this results in a missing frame causing a hang

Change-Id: Ib5f3dc6a3999449faa2853cee2944af2fb065a20
BUG: 1002399
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5731
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-08-29 03:35:44 -07:00
Brian Foster
bf63e7c6a3 cluster/stripe: enable coalesce mode by default
It has been available for a while now and is probably the sane
default due to the more efficient layout and performance benefit.

BUG: 1001207
Change-Id: I6275f9741866c0afd6e685f8dc5867a86485fd20
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-on: http://review.gluster.org/5624
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-28 21:33:06 -07:00
Pranith Kumar K
db0b19a542 cluster/afr: Add special handling for failure postops
Idea is to not leave the file in FOOL-FOOL scenario in case on
all the bricks data transaction failed with EDQUOT to avoid
increasing un-necessary load of self-heals in the system.

For directory transactions don't leave pending changelog in case
the failures are seen on all the subvolumes.

Change-Id: I38a5561d1d581a78347a76a4a509514e4a0c3fb7
BUG: 969461
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5709
Reviewed-by: Anand Avati <avati@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-08-28 17:42:59 -07:00
Kaleb S. KEITHLEY
b880b6b290 stripe: remove unused param, handle mem alloc failure
Change-Id: I9c27b1edab111031ca8eea9cc49480ea01e39089
BUG: 1002207
Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
Reviewed-on: http://review.gluster.org/5716
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-28 16:58:43 -07:00
Anand Avati
bbcdbd8c36 synctask: minor enhancements
- Enhance syncenv_new() to accept scaling parameters of syncproc.
  Previously the scaling parameters were hardcoded and decided at
  compile time.

- New API synctask_create() which returns the created synctask. This
  is similar to synctask_new which only returned the status of whether
  a synctask could be created or not.

  The meaning of NULL cbk in synctask_create() means the task is
  "joinable". Until synctask_join() is called on such a synctask,
  the task is not reaped and resources are not destroyed. The
  task would be in a zombie state after synctask_fn returns and
  before synctask_join() is called.

Change-Id: I368ec9037de9510d2ba951f0aad86aaf18d9a6b6
BUG: 986775
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5365
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2013-08-28 15:52:24 -07:00
Pranith Kumar K
faef08b7cf cluster/afr: Don't delay post op in cases of failures
Change-Id: Ib0c3af6babc61dc3ed45252582876e2f243d6446
BUG: 958118
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5635
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-28 15:40:31 -07:00
Amar Tumballi
cb533fc296 core: remove GLUSTERFS_CREATE_MODE_KEY usage
Change-Id: I23b8cb7223b91a55af1cd4214f61bbe0e87351f6
BUG: 952029
Signed-off-by: Amar Tumballi <amarts@redhat.com>
Reviewed-on: http://review.gluster.org/5683
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-23 12:04:58 -07:00
Amar Tumballi
df9bfff8ea core: changes to support gfid-access
Change-Id: I38d2fdc47e4b805deafca6805e54807976ffdb7e
Signed-off-by: Amar Tumballi <amarts@redhat.com>
BUG: 952029
Reviewed-on: http://review.gluster.org/5496
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-21 12:13:16 -07:00
Amar Tumballi
271804a26c Revert "fuse: auxiliary gfid mount support"
This reverts commit 4c0f4c8a89039b1fa1c9c015fb6f273268164c20.

Conflicts:
	xlators/mount/fuse/src/fuse-bridge.c

For build issues added CREATE_MODE_KEY definition in:
        libglusterfs/src/glusterfs.h

Change-Id: I8093c2a0b5349b01e1ee6206025edbdbee43055e
BUG: 952029
Signed-off-by: Amar Tumballi <amarts@redhat.com>
Reviewed-on: http://review.gluster.org/5495
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-21 12:12:07 -07:00
shishir gowda
2a19578774 cluster/dht: Del GF_READDIR_SKIP_DIRS key from dict for first_up
Currently, we sent GF_READDIR_SKIP_DIRS for all subvolumes if
first_subvol != first_up_subvolume.

Also first_up_subvolume can change with-in the life of a call and
cbk. Saving the first_up_subvol in dht_local for checks.

Change-Id: I6e369e63f29c9761993f2a66ed768c424bb44d27
BUG: 996474
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5577
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-08-18 05:01:27 -07:00
Anand Avati
1d1daa234e cluster/afr: Add largest file is source policy
For Write Once Read Many times type of work-load choosing largest
file to be the source will always resolve fool-fool
scenarios correctly. In other cases we fsync() the files and
will have a reliable 'wise man'.

Change-Id: Ic4dbea8d06db6d578fbcb866fb65ee2d066ac7ba
BUG: 958118
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5519
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
2013-08-14 09:30:43 -07:00
Anand Avati
8360037701 afr: treat appending writes as stable writes.
Durability of appending writes is implicit in the file size. Therefore
performing an explicit fsync() is unnecessary in such cases as self-heal
can check for the size of file when pending changelog is not unambiguous.

Change-Id: I05446180a91d20e0dbee5de5a7085b87d57f178a
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5501
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
2013-08-13 23:45:03 -07:00
Anand Avati
a1fe3d040a cluster/afr: skip directory inspection when entry self-heal is off
When user has explicitly configured to disable entry self-heal in the
client, it is wrong to do the healing in opendir. So skip it. This
is especially useful to reduce opendir() times after graph switches.

Change-Id: Ic6eb9ff2334a5b8417f2f35410a366a536bad5df
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5528
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-08-11 11:13:02 -07:00
Pranith Kumar K
ff38e3e028 cluster/afr: Unwind frame on error in readdir[p]
Change-Id: I5701bf115e0aa1adb4fb52f5418534910a2268d4
BUG: 994959
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5531
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-08-08 21:05:06 -07:00
Ravishankar N
0f77e30c90 afr: check for non-zero call_count before doing a stack wind
When one of the bricks of a 1x2 replicate volume is down,
writes to the volume is causing a race between afr_flush_wrapper() and
afr_flush_cbk(). The latter frees up the call_frame's local variables
in the unwind, while the former accesses them in the for loop and
sending a stack wind the second time. This causes the FUSE mount process
(glusterfs) toa receive a SIGSEGV when the corresponding unwind is hit.

This patch adds the call_count check which was removed when
afr_flush_wrapper() was introduced in commit 29619b4e

Change-Id: I87d12ef39ea61cc4c8244c7f895b7492b90a7042
BUG: 988182
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Reviewed-on: http://review.gluster.org/5393
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-08-07 03:35:10 -07:00
Pranith Kumar K
36b102645a cluster/afr: Disable eager-lock if open-fd-count > 1
Lets say mount1 has eager-lock(full-lock) and after the eager-lock
is taken mount2 opened the same file, it won't be able to
perform any data operations until mount1 releases eager-lock.
To avoid such scenario do not enable eager-lock for transaction
if open-fd-count is > 1. Delaying of changelog piggybacking is
avoided in this situation.

Change-Id: I51b45d6a7c216a78860aff0265a0b8dabc6423a5
BUG: 910217
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5432
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: venkatesh somyajulu <vsomyaju@redhat.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-08-02 02:25:55 -07:00
Anand Avati
394055e31f dht: make linkfile creation mode explicitly get set
Because of posix default_acl on parent directory, the mode
of linkfile can get masked with the mode in the default acl.

This breaks DHT integrity. So let the mode get explicitly reset
after mknod().

Change-Id: Ia7328e1ee7b4430bda308f9da293dba78405e081
BUG: 990410
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/5440
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Raghavendra Talur <rtalur@redhat.com>
2013-07-31 13:31:20 -07:00
shishir gowda
ac2cd684c2 cluster/dht: Fix non-regular file ownership during rebalance
Currently non-regular files were created as root:root, and
their ownership never healed during rebalance process.

Also, in dht_linkfile_attr_heal, we have to heal the linkfiles
as default, as currently linkfiles are created as root:root.

That check existed, as earlier linkfiles were created as
frame->root->uid/gid

Change-Id: I6cd88361b81bdd500e15bc47b623f5db8eec88e9
BUG: 990154
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5434
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-07-31 03:42:42 -07:00
Pranith Kumar K
177f32e5b0 cluster/afr: Print self-heal log when self-heal succeeds
Change-Id: I95e47e589419dc6a032cbd8ba01964b6c176c2d5
BUG: 927146
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5408
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-31 01:41:45 -07:00
shishir gowda
e306698b00 cluster/dht: Treat migration failures due to space constraints as skipped
Currently rebalance/remove-brick op's display migration failed count even
for files which failed due to space issues (not enough space for file, or
migration leading to cluster imbalance)

These will now be counted as skipped, and rebalance/remove-brick status
will display the additional counter

Change-Id: I674904d380b5f8300e9ca9e6af557c3d30d6cff4
BUG: 989846
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5399
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-30 23:55:59 -07:00
Vijay Bellur
c2064eff89 cluster/dht: Allow non-local clients to function with nufa volumes.
nufa fails to init if a local brick is not found as of today.
With this patch, if a local brick is not found, nufa switches
over to dht mode of operations.

Change-Id: I50ac1af37621b1e776c8c00a772b8e3dfb3691df
BUG: 980838
Signed-off-by: Vijay Bellur <vbellur@redhat.com>
Reviewed-on: http://review.gluster.org/5414
Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
2013-07-29 06:27:47 -07:00
Pranith Kumar K
4944fc943e cluster/afr: Handle REPLICATE_TRASH_DIR from old bricks
Change-Id: Ib99f79d3fa607c818dbc62006516480f598d8add
BUG: 886998
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/4640
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-26 01:13:58 -07:00
shishir gowda
ab87748907 cluster/dht: mark linkfile creation/deletion as internal fop
Currently dht creates/deletes linkfiles for various ops like
rename/linking and when layout changes. dht_linkfile_create
already sends a key GLUSTERFS_INTERNAL_FOP_KEY in dict to
identify this as an internal fop and not user based op.

Enhancing rename related links/unlinks to send this key too.
Marker/changelog or other xlators can now identify these as
internal fops and handle them accordingly

Change-Id: Ib1ca789e6dbce48703c55ad3f4f88f7cd2df3d06
BUG: 987428
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5335
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-24 04:27:29 -07:00
Anand Avati
37ac6bdca8 storage/posix: implement batched fsync in a single thread
Because of the extra fsync()s issued by AFR transaction, they
could potentially "clog" all the io-threads denying unrelated
operations from making progress.

This patch assigns a dedicated thread to issues fsyncs, as
an experimental feature to understand performance characteristics
with the approach.

As a basis, incoming individual fsync requests are grouped into
batches, falling in the same @batch-fsync-delay-usec window of
time. These windows can extend in practice, as processing of
the previous batch can take longer than @batch-fsync-delay-usec
while new requests are getting batched.

The feature support three modes (similar to the -S modes of fs_mark)

- syncfs: In this mode one syncfs() is issued per batch, instead
  of N fsync()s (one per file.)

- syncfs-single-fsync: In this mode one syncfs() is issued per
  batch (which, on Linux, guarantees the completion of write-out
  of dirty pages in the filesystem up to that point) and one single
  fsync() to synchronize or flush the controller/drive cache. This
  corresponds to -S 2 of fsmark.

- syncfs-reverse-fsync: In this mode, one syncfs() is issued per
  batch, and all the open files in that batch are fsync()'ed in
  the reverse order of the queue. This corresponds to -S 4 of
  fsmark.

- reverse-fsync: In this mode, no syncfs() is issued and all the
  files in the batch are fsync()'ed in the reverse order. This
  corresponds to -S 3 of fsmark.

Change-Id: Ia1e170a810c780c8d80e02cf910accc4170c4cd4
BUG: 927146
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-on: http://review.gluster.org/4746
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-23 06:11:12 -07:00
Pranith Kumar K
eef0737ca6 cluster/afr: Handle parallel hardlinks self-heal
Change-Id: Ieda11870c65edae500140b6c061f15a7b3f264f3
BUG: 986905
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
Reviewed-on: http://review.gluster.org/5370
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-23 00:33:39 -07:00
Raghavendra G
4c0f4c8a89 fuse: auxiliary gfid mount support
* files can be accessed directly through their gfid and not just
  through their paths. For eg., if the gfid of a file is
  f3142503-c75e-45b1-b92a-463cf4c01f99, that file can be accessed
  using <gluster-mount>/.gfid/f3142503-c75e-45b1-b92a-463cf4c01f99

  .gfid is a virtual directory used to seperate out the namespace
  for accessing files through gfid. This way, we do not conflict with
  filenames which can be qualified as uuids.

* A new file/directory/symlink can be created with a pre-specified
  gfid. A setxattr done on parent directory with fuse_auxgfid_newfile_args_t
  initialized with appropriate fields as value to key "glusterfs.gfid.newfile"
  results in the entry <parent>/bname whose gfid is set to args.gfid. The
  contents of the structure should be in network byte order.

  struct auxfuse_symlink_in {
        char     linkpath[]; /* linkpath is a null terminated string */
  } __attribute__ ((__packed__));

  struct auxfuse_mknod_in {
        unsigned int   mode;
        unsigned int   rdev;
        unsigned int   umask;
  } __attribute__ ((__packed__));

  struct auxfuse_mkdir_in {
        unsigned int   mode;
        unsigned int   umask;
  } __attribute__ ((__packed__));

  typedef struct {
        unsigned int  uid;
        unsigned int  gid;
        char          gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid string
                                                      * in canonical form.
                                                      */
        unsigned int  st_mode;
        char          bname[];     /* bname is a null terminated string */

        union {
                struct auxfuse_mkdir_in   mkdir;
                struct auxfuse_mknod_in   mknod;
                struct auxfuse_symlink_in symlink;
        } __attribute__ ((__packed__)) args;
  } __attribute__ ((__packed__)) fuse_auxgfid_newfile_args_t;

  An initial consumer of this feature would be geo-replication to
  create files on slave mount with same gfids as that on master.
  It will also help gsyncd to access files directly through their
  gfids. gsyncd in its newer version will be consuming a changelog
  (of master) containing operations on gfids and sync corresponding
  files to slave.

* Also, bring in support to heal gfids with a specific value.
  fuse-bridge sends across a gfid during a lookup, which storage
  translators assign to an inode (file/directory etc) if there is
  no gfid associated it. This patch brings in support
  to specify that gfid value from an application, instead of relying
  on random gfid generated by fuse-bridge.

  gfids can be healed through setxattr interface. setxattr should be
  done on parent directory. The key used is "glusterfs.gfid.heal"
  and the value should be the following structure whose contents
  should be in network byte order.

  typedef struct {
        char      gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid
                                                      * string in canonical form
                                                      */
        char      bname[]; /* a null terminated basename */
  } __attribute__((__packed__)) fuse_auxgfid_heal_args_t;

  This feature can be used for upgrading older geo-rep setups where gfids
  of files are different on master and slave to newer setups where they
  should be same. One can delete gfids on slave using setxattr -x and
  .glusterfs and issue stat on all the files with gfids from master.

Thanks to "Amar Tumballi" <amarts@redhat.com> and "Csaba Henk"
<csaba@redhat.com> for their inputs.

Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
Change-Id: Ie8ddc0fb3732732315c7ec49eab850c16d905e4e
BUG: 952029
Reviewed-on: http://review.gluster.com/#/c/4702
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Amar Tumballi <amarts@redhat.com>
Reviewed-on: http://review.gluster.org/4702
Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Vijay Bellur <vbellur@redhat.com>
2013-07-19 01:14:08 -07:00
shishir gowda
ad5ab12160 dht: fix dht_discover_cbk doing a wrong layout set.
with the sequence of operations are like below, we have issues
with current code (MP == mountpoint):
T0,MP1# mkdir /abcd (Succeeds on hash_subvol)
T1,MP2# mkdir /abcd (Gets EEXIST as dir exists in hash_subvol)
T2,MP2# mkdir /.gfid/<abcd's gfid>/xyz (lookup happens on abcd's
        gfid, calls dht_discover)
T3,MP1# (Completes mkdir(), goes to dir_selfheal to set the layouts).
T4,MP2# (dht_discover_cbk gets success for lookup as the entry
        existed, as layout is not yet written, it says normalize
        done, found holes).
T5,MP2# (as layout anomaly is not considered an issue in this patch,
        dht_layout_set happens on inode, with all xlators pointing
        to 0s)
T6,MP1# (completes mkdir call, inode has proper layouts)
T7,MP2# mkdir /.gfid/<abcd's gfid>/xyz fails with ENOENT
        (with log saying no subvol found for hash value of xyz.

Porting Amar's fix from down-stream beta branch.

Change-Id: Ibdc37ee614c96158a1330af19cad81a39bee2651
BUG: 982913
Original-author: Amar Tumballi <amarts@redhat.com>
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5302
Reviewed-by: Amar Tumballi <amarts@redhat.com>
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-07-17 22:49:38 -07:00
shishir gowda
3e1d8e1689 cluster/dht: Prevent dht_access from going into a loop.
If access fails with ENOTCONN, do not wind to same subvol.
We wind to first-up-subvol if access fails with ENOTCONN.
In few cases, if dht has only 1 subvolume, and access fails with
ENOTCONN, we go into a infinite loop of winding to same subvol

The fix is to check if we previously wound to same subvol, and
fail if first-up-subvol is same.

Change-Id: Ib5d3ce7d33e8ea09147905a7df1ed280874fa549
BUG: 983431
Signed-off-by: shishir gowda <sgowda@redhat.com>
Reviewed-on: http://review.gluster.org/5319
Tested-by: Gluster Build System <jenkins@build.gluster.com>
Reviewed-by: Anand Avati <avati@redhat.com>
2013-07-15 23:56:34 -07:00