linux

iv/linux

Author	SHA1	Message	Date
Eric W. Biederman	05cb11c17e	ceph: Translate between uid and gids in cap messages and kuids and kgids - Make the uid and gid arguments of send_cap_msg() used to compose ceph_mds_caps messages of type kuid_t and kgid_t. - Pass inode->i_uid and inode->i_gid in __send_cap to send_cap_msg() through variables of type kuid_t and kgid_t. - Modify struct ceph_cap_snap to store uids and gids in types kuid_t and kgid_t. This allows capturing inode->i_uid and inode->i_gid in ceph_queue_cap_snap() without loss and pssing them to __ceph_flush_snaps() where they are removed from struct ceph_cap_snap and passed to send_cap_msg(). - In handle_cap_grant translate uid and gids in the initial user namespace stored in struct ceph_mds_cap into kuids and kgids before setting inode->i_uid and inode->i_gid. Cc: Sage Weil <sage@inktank.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-02-12 03:19:24 -08:00
Sage Weil	5ef50c3bec	ceph: simplify+fix atomic_open The initial ->atomic_open op was carried over from the old intent code, which was incomplete and didn't really work. Replace it with a fresh method. In particular: * always attempt to do an atomic open+lookup, both for the create case and for lookups of existing files. * fix symlink handling by returning 1 to the VFS so that we can follow the link to its destination. This fixes a longstanding ceph bug (#2392). Signed-off-by: Sage Weil <sage@inktank.com>	2012-08-02 09:11:19 -07:00
Linus Torvalds	cc8362b1f6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...	2012-07-31 14:35:28 -07:00
Alex Elder	aa711ee340	ceph: define snap counts as u32 everywhere There are two structures in which a count of snapshots are maintained: struct ceph_snap_context { ... u32 num_snaps; ... } and struct ceph_snap_realm { ... u32 num_prior_parent_snaps; /* had prior to parent_since */ ... u32 num_snaps; ... } These fields never take on negative values (e.g., to hold special meaning), and so are really inherently unsigned. Furthermore they take their value from over-the-wire or on-disk formatted 32-bit values. So change their definition to have type u32, and change some spots elsewhere in the code to account for this change. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2012-07-30 18:15:47 -07:00
Al Viro	30d9049474	kill struct opendata Just pass struct file . Methods are happier that way... There's no need to return struct file from finish_open() now, so let it return int. Next: saner prototypes for parts in namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:39 +04:00
Al Viro	d95852777b	make ->atomic_open() return int Change of calling conventions: old new NULL 1 file 0 ERR_PTR(-ve) -ve Caller knows that struct file *; no need to return it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:35 +04:00
Al Viro	47237687d7	->atomic_open() prototype change - pass int * instead of bool * ... and let finish_open() report having opened the file via that sucker. Next step: don't modify od->filp at all. [AV: FILE_CREATE was already used by cifs; Miklos' fix folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:31 +04:00
Miklos Szeredi	2d83bde9a1	ceph: implement i_op->atomic_open() Add an ->atomic_open implementation which replaces the atomic lookup+open+create operation implemented via ->lookup and ->create operations. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:16 +04:00
Miklos Szeredi	3819219b59	ceph: remove unused arg from ceph_lookup_open() What was the purpose of this? Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:33:16 +04:00
Alex Elder	3ce6cd1233	ceph: avoid repeatedly computing the size of constant vxattr names All names defined in the directory and file virtual extended attribute tables are constant, and the size of each is known at compile time. So there's no need to compute their length every time any file's attribute is listed. Record the length of each string and use it when needed to determine the space need to represent them. In addition, compute the aggregate size of strings in each table just once at initialization time. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:46 -05:00
Amon Ott	a661fc5611	ceph: use 2 instead of 1 as fallback for 32-bit inode number The root directory of the Ceph mount has inode number 1, so falling back to 1 always creates a collision. 2 is unused on my test systems and seems less likely to collide. Signed-off-by: Amon Ott <ao@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-03-22 10:47:45 -05:00
Linus Torvalds	1a52bb0b68	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: ensure prealloc_blob is in place when removing xattr rbd: initialize snap_rwsem in rbd_add() ceph: enable/disable dentry complete flags via mount option vfs: export symbol d_find_any_alias() ceph: always initialize the dentry in open_root_dentry() libceph: remove useless return value for osd_client __send_request() ceph: avoid iput() while holding spinlock in ceph_dir_fsync ceph: avoid useless dget/dput in encode_fh ceph: dereference pointer after checking for NULL crush: fix force for non-root TAKE ceph: remove unnecessary d_fsdata conditional checks ceph: Use kmemdup rather than duplicating its implementation Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs always initialize the dentry in open_root_dentry)	2012-01-13 10:29:21 -08:00
Sage Weil	a40dc6cc2e	ceph: enable/disable dentry complete flags via mount option Enable/disable use of the dentry dir 'complete' flag via a mount option. This lets the admin control whether ceph uses the dcache to satisfy negative lookups or readdir when it has the entire directory contents in its cache. This is purely a performance optimization; correctness is guaranteed whether it is enabled or not. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:40 -08:00
Al Viro	5706b27dea	ceph: propagate umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:16 -05:00
Sage Weil	be655596b3	ceph: use i_ceph_lock instead of i_lock We have been using i_lock to protect all kinds of data structures in the ceph_inode_info struct, including lists of inodes that we need to iterate over while avoiding races with inode destruction. That requires grabbing a reference to the inode with the list lock protected, but igrab() now takes i_lock to check the inode flags. Changing the list lock ordering would be a painful process. However, using a ceph-specific i_ceph_lock in the ceph inode instead of i_lock is a simple mechanical change and avoids the ordering constraints imposed by igrab(). Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>	2011-12-07 10:46:44 -08:00
Sage Weil	c6ffe10015	ceph: use new D_COMPLETE dentry flag We used to use a flag on the directory inode to track whether the dcache contents for a directory were a complete cached copy. Switch to a dentry flag CEPH_D_COMPLETE that is safely updated by ->d_prune(). Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-05 21:10:10 -07:00
Sage Weil	b58dc4100b	ceph: clear parent D_COMPLETE flag when on dentry prune When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag on the parent dentry. Do this for the live and snapshotted namespaces. Do not bother for the .snap dir contents, since we do not cache that. Signed-off-by: Sage Weil <sage@newdream.net>	2011-11-03 09:23:49 -07:00
Amon Ott	3310f7541f	ceph: fix 32-bit ino numbers Fix 32-bit ino generation to not always be 1. Signed-off-by: Amon Ott <a.ott@m-privacy.de>	2011-10-25 16:10:17 -07:00
Sage Weil	83817e35cb	ceph: rename rsize -> rasize It controls readahead. Signed-off-by: Sage Weil <sage@newdream.net>	2011-10-25 16:10:15 -07:00
Linus Torvalds	ba5b56cb3e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT\|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...	2011-07-26 13:38:50 -07:00
Sage Weil	e5f86dc377	ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:55 -07:00
Sage Weil	5f21c96dd5	ceph: protect access to d_parent d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:30:29 -07:00
Sage Weil	468640e32c	ceph: fix ceph_lookup_open intent usage We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:28:11 -07:00
Sage Weil	9cfa1098dc	ceph: use flag bit for at_end readdir flag This saves us a word of memory per file. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:18 -07:00
Sage Weil	4918b6d140	ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:26:07 -07:00
Sage Weil	252c6728de	ceph: add flags field to file_info Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-07-26 11:25:27 -07:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Al Viro	10556cb21a	->permission() sanitizing: don't pass flags to ->permission() not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:24 -04:00
Henry C Chang	d3d0720d4a	ceph: do not use i_wrbuffer_ref as refcount for Fb cap We increments i_wrbuffer_ref when taking the Fb cap. This breaks the dirty page accounting and causes looping in __ceph_do_pending_vmtruncate, and ceph client hangs. This bug can be reproduced occasionally by running blogbench. Add a new field i_wb_ref to inode and dedicate it to Fb reference counting. Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-11 10:44:48 -07:00
Sage Weil	fca65b4ad7	ceph: do not call __mark_dirty_inode under i_lock The __mark_dirty_inode helper now takes i_lock as of `250df6ed`. Fix the one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the flags value so that the callers can do it outside of i_lock. Signed-off-by: Sage Weil <sage@newdream.net>	2011-05-04 12:56:45 -07:00
Sage Weil	80456f8672	ceph: move readahead default to fs/ceph from libceph Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:23 -07:00
Yehuda Sadeh	ad1fee96cb	ceph: add ino32 mount option The ino32 mount option forces the ceph fs to report 32 bit ino values. This is useful for 64 bit kernels with 32 bit userspace. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-03-21 12:24:22 -07:00
Sage Weil	9bde178d05	Revert "ceph: keep reference to parent inode on ceph_dentry" This reverts commit `97d79b403e`. This fails to account for d_parent changes due to rename or disconnected dentries due to submounts or NFS reexports. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-03 10:09:50 -08:00
Linus Torvalds	8bd89ca220	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: keep reference to parent inode on ceph_dentry ceph: queue cap_snaps once per realm libceph: fix socket write error handling libceph: fix socket read error handling	2011-02-21 15:01:38 -08:00
Yehuda Sadeh	97d79b403e	ceph: keep reference to parent inode on ceph_dentry When creating a new dentry we now hold a reference to the parent inode in the ceph_dentry. This is required due to the new RCU changes from `949854d0`, which set dentry->d_parent to NULL in d_kill before calling the ->release() callback. If/when that behavior is changed, we can revert this hack. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2011-02-19 19:59:14 -08:00
Linus Torvalds	a170315420	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup when trying to mount inexistent image net/ceph: make ceph_msgr_wq non-reentrant ceph: fsc->*_wq's aren't used in memory reclaim path ceph: Always free allocated memory in osdmap_decode() ceph: Makefile: Remove unnessary code ceph: associate requests with opening sessions ceph: drop redundant r_mds field ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS ceph: add dir_layout to inode	2011-01-13 10:25:24 -08:00
Sage Weil	6c0f3af72c	ceph: add dir_layout to inode Add a ceph_dir_layout to the inode, and calculate dentry hash values based on the parent directory's specified dir_hash function. This is needed because the old default Linux dcache hash function is extremely week and leads to a poor distribution of files among dir fragments. Signed-off-by: Sage Weil <sage@newdream.net>	2011-01-12 15:15:12 -08:00
Nick Piggin	b74c79e993	fs: provide rcu-walk aware permission i_ops Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:29 +11:00
Sage Weil	cd045cb42a	ceph: fix rdcache_gen usage and invalidate We used to use rdcache_gen to indicate whether we "might" have cached pages. Now we just look at the mapping to determine that. However, some old behavior remains from that transition. First, rdcache_gen == 0 no longer means we have no pages. That can happen at any time (presumably when we carry FILE_CACHE). We should not reset it to zero, and we should not check that it is zero. That means that the only purpose for rdcache_revoking is to resolve races between new issues of FILE_CACHE and an async invalidate. If they are equal, we should invalidate. On success, we decrement rdcache_revoking, so that it is no longer equal to rdcache_gen. Similarly, if we success in doing a sync invalidate, set revoking = gen - 1. (This is a small optimization to avoid doing unnecessary invalidate work and does not affect correctness.) Signed-off-by: Sage Weil <sage@newdream.net>	2010-11-08 07:29:05 -08:00
Sage Weil	efa4c1206e	ceph: do not carry i_lock for readdir from dcache We were taking dcache_lock inside of i_lock, which introduces a dependency not found elsewhere in the kernel, complicationg the vfs locking scalability work. Since we don't actually need it here anyway, remove it. We only need i_lock to test for the I_COMPLETE flag, so be careful to do so without dcache_lock held. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:38:27 -07:00
Yehuda Sadeh	3d14c5d2b6	ceph: factor out libceph from Ceph file system This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>	2010-10-20 15:37:28 -07:00
Sage Weil	e835124c2b	ceph: only send one flushsnap per cap_snap per mds session Sending multiple flushsnap messages is problematic because we ignore the response if the tid doesn't match, and the server may only respond to each one once. It's also a waste. So, skip cap_snaps that are already on the flushing list, unless the caller tells us to resend (because we are reconnecting). Signed-off-by: Sage Weil <sage@newdream.net>	2010-09-17 08:03:08 -07:00
Sage Weil	ae00d4f37f	ceph: fix cap_snap and realm split The cap_snap creation/queueing relies on both the current i_head_snapc _and_ the i_snap_realm pointers being correct, so that the new cap_snap can properly reference the old context and the new i_head_snapc can be updated to reference the new snaprealm's context. To fix this, we: - move inodes completely to the new (split) realm so that i_snap_realm is correct, and - generate the new snapc's _before_ queueing the cap_snaps in ceph_update_snap_trace(). Signed-off-by: Sage Weil <sage@newdream.net>	2010-09-16 16:26:51 -07:00
Sage Weil	7d8cb26d7d	ceph: maintain i_head_snapc when any caps are dirty, not just for data We used to use i_head_snapc to keep track of which snapc the current epoch of dirty data was dirtied under. It is used by queue_cap_snap to set up the cap_snap. However, since we queue cap snaps for any dirty caps, not just for dirty file data, we need to keep a valid i_head_snapc anytime we have dirty\|flushing caps. This fixes a NULL pointer deref in queue_cap_snap when writing back dirty caps without data (e.g., snaptest-authwb.sh). Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-24 16:24:18 -07:00
Sage Weil	4a625be472	ceph: include dirty xattrs state in snapped caps When we snapshot dirty metadata that needs to be written back to the MDS, include dirty xattr metadata. Make the capsnap reference the encoded xattr blob so that it will be written back in the FLUSHSNAP op. Also fix the capsnap creation guard to include dirty auth or file bits, not just tests specific to dirty file data or file writes in progress (this fixes auth metadata writeback). Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-22 15:16:46 -07:00
Sage Weil	52dfb8ac0e	ceph: constify dentry_operations This makes checkpatch happy. Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-03 10:25:30 -07:00
Greg Farnum	40819f6fb2	ceph: add flock/fcntl lock support Implement flock inode operation to support advisory file locking. All lock/unlock operations are synchronous with the MDS. Lock state is sent when reconnecting to a recovering MDS to restore the shared lock state. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-02 16:10:53 -07:00
Greg Farnum	c6f3fdc592	ceph: add CEPH_FEATURE_FLOCK to the supported feature bits This informs the server that we will accept v2 client_caps format and v2 client_reconnect format messages. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-02 15:48:51 -07:00
Sage Weil	a8b763a9b3	ceph: use %pU to print uuid (fsid) Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-01 20:11:42 -07:00
Sage Weil	6a2593823a	ceph: specify supported features in super.h Specify the supported/required feature bits in super.h client code instead of using the definitions from the shared kernel/userspace headers (which will go away shortly). Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-01 20:11:41 -07:00
Greg Farnum	2bc50259fa	ceph: add ceph_get_cap_for_mds function. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-01 20:11:41 -07:00
Yehuda Sadeh	37151668ba	ceph: do caps accounting per mds_client Caps related accounting is now being done per mds client instead of just being global. This prepares ground work for a later revision of the caps preallocated reservation list. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-08-01 20:11:40 -07:00
Linus Torvalds	b612a05537	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout	2010-05-30 08:56:39 -07:00
Andrea Gelmini	984c76908e	ceph: removed duplicated #includes fs/ceph/auth.c: linux/slab.h is included more than once. fs/ceph/super.h: linux/slab.h is included more than once. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:37 -07:00
Christoph Hellwig	7ea8085910	drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:05:02 -04:00
Sage Weil	23804d91f1	ceph: specify max_bytes on readdir replies Specify max bytes in request to bound size of reply. Add associated mount option with default value of 512 KB. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:41 -07:00
Sage Weil	6e19a16ef2	ceph: clean up mount options, ->show_options() Ensure all options are included in /proc/mounts. Some cleanup. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:29 -07:00
Cheng Renquan	640ef79d27	ceph: use ceph_sb_to_client instead of ceph_client ceph_sb_to_client and ceph_client are really identical, we need to dump one; while function ceph_client is confusing with "struct ceph_client", ceph_sb_to_client's definition is more clear; so we'd better switch all call to ceph_sb_to_client. -static inline struct ceph_client ceph_client(struct super_block sb) -{ - return sb->s_fs_info; -} Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:17 -07:00
Sage Weil	81a6cf2d30	ceph: invalidate affected dentry leases on aborted requests If we abort a request, we return to caller, but the request may still complete. And if we hold the dir FILE_EXCL bit, we may not release a lease when sending a request. A simple un-tar, control-c, un-tar again will reproduce the bug (manifested as a 'Cannot open: File exists'). Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so we don't have valid (but incorrect) leases. Do the same, consistently, at other sites where I_COMPLETE is similarly cleared. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 10:25:45 -07:00
Sage Weil	d45d0d970f	ceph: add missing #includes Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Linus Torvalds	96e35b40c0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use separate class for ceph sockets' sk_lock ceph: reserve one more caps space when doing readdir ceph: queue_cap_snap should always queue dirty context ceph: fix dentry reference leak in dcache readdir ceph: decode v5 of osdmap (pool names) [protocol change] ceph: fix ack counter reset on connection reset ceph: fix leaked inode ref due to snap metadata writeback race ceph: fix snap context reference leaks ceph: allow writeback of snapped pages older than 'oldest' snapc ceph: fix dentry rehashing on virtual .snap dir	2010-04-14 18:45:31 -07:00
Sage Weil	fc837c8f04	ceph: queue_cap_snap should always queue dirty context This simplifies the calling convention, and fixes a bug where we queue a capsnap with a context other than i_head_snapc (the one that matches the dirty pages). The result was a BUG at fs/ceph/caps.c:2178 on writeback completion when a capsnap matching the writeback snapc could not be found. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-13 12:28:31 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Stephen Rothwell	f1a3d57213	ceph: update for write_inode API change Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-05 14:49:41 -08:00
Yehuda Sadeh	422d2cb8f9	ceph: reset osd after relevant messages timed out This simplifies the process of timing out messages. We keep lru of current messages that are in flight. If a timeout has passed, we reset the osd connection, so that messages will be retransmitted. This is a failsafe in case we hit some sort of problem sending out message to the OSD. Normally, we'll get notification via an updated osdmap if there are problems. If a request is older than the keepalive timeout, send a keepalive to ensure we detect any breaks in the TCP connection. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-04 11:26:35 -08:00
Sage Weil	e9964c1023	ceph: fix flush_dirty_caps race with caps migration The flush_dirty_caps() used to loop over the first entry of the cap_dirty dirty list on the assumption that after calling ceph_check_caps() it would be removed from the list. This isn't true for caps that are being migrated between MDSs, where we've received the EXPORT but not the IMPORT. Instead, do a safe list iteration, and pin the next inode on the list via the CEPH_I_NOFLUSH flag. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-01 15:28:02 -08:00
Sage Weil	2600d2dd50	ceph: drop messages on unregistered mds sessions; cleanup Verify the mds session is currently registered before handling incoming messages. Clean up message handlers to pull mds out of session->s_mds instead of less trustworthy src field. Clean up con_{get,put} debug output. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-23 14:26:35 -08:00
Sage Weil	7c1332b8cb	ceph: fix iterate_caps removal race We need to be able to iterate over all caps on a session with a possibly slow callback on each cap. To allow this, we used to prevent cap reordering while we were iterating. However, we were not safe from races with removal: removing the 'next' cap would make the next pointer from list_for_each_entry_safe be invalid, and cause a lock up or similar badness. Instead, we keep an iterator pointer in the session pointing to the current cap. As before, we avoid reordering. For removal, if the cap isn't the current cap we are iterating over, we are fine. If it is, we clear cap->ci (to mark the cap as pending removal) but leave it in the session list. In iterate_caps, we can safely finish removal and get the next cap pointer. While we're at it, clean up put_cap to not take a cap reservation context, as it was never used. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-17 10:02:47 -08:00
Sage Weil	85ccce43a3	ceph: clean up readdir caps reservation Use a global counter for the minimum number of allocated caps instead of hard coding a check against readdir_max. This takes into account multiple client instances, and avoids examining the superblock mount options when a cap is dropped. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-17 10:02:43 -08:00
Sage Weil	a105f00cf1	ceph: use rbtree for snap_realms Switch from radix tree to rbtree for snap realms. This is much more appropriate given that realm keys are few and far between. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-16 22:01:09 -08:00
Sage Weil	3c6f6b79a6	ceph: cleanup async writeback, truncation, invalidate helpers Grab inode ref in helper. Make work functions static, with consistent naming. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-11 11:48:54 -08:00
Yehuda Sadeh	f5a2041bd9	ceph: put unused osd connections on lru Instead of removing osd connection immediately when the requests list is empty, put the osd connection on an lru. Only if that osd has not been used for more than a specified time, will it be removed. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-11 11:48:48 -08:00
Sage Weil	9bd2e6f8ba	ceph: allow renewal of auth credentials Add infrastructure to allow the mon_client to periodically renew its auth credentials. Also add a messenger callback that will force such a renewal if a peer rejects our authenticator. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-10 15:04:47 -08:00
Yehuda Sadeh	2baba25019	ceph: writeback congestion control Set bdi congestion bit when amount of write data in flight exceeds adjustable threshold. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:56 -08:00
Sage Weil	06edf046dd	ceph: include link to bdi in debugfs Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:54 -08:00
Sage Weil	0743304d87	ceph: fix debugfs entry, simplify fsid checks We may first learn our fsid from any of the mon, osd, or mds maps (whichever the monitor sends first). Consolidate checks in a single helper. Initialize the client debugfs entry then, since we need the fsid (and global_id) for the directory name. Also remove dead mount code. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-20 14:24:27 -08:00
Sage Weil	4e7a5dcd1b	ceph: negotiate authentication protocol; implement AUTH_NONE protocol When we open a monitor session, we send an initial AUTH message listing the auth protocols we support, our entity name, and (possibly) a previously assigned global_id. The monitor chooses a protocol and responds with an initial message. Initially implement AUTH_NONE, a dummy protocol that provides no security, but works within the new framework. It generates 'authorizers' that are used when connecting to (mds, osd) services that simply state our entity name and global_id. This is a wire protocol change. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-18 16:19:57 -08:00
Sage Weil	039934b895	ceph: build cleanly without CONFIG_DEBUG_FS Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-12 15:56:51 -08:00
Sage Weil	cdac830313	ceph: remove recon_gen logic We don't get an explicit affirmative confirmation that our caps reconnect, nor do we necessarily want to pay that cost. So, take all this code out for now. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-10 16:03:53 -08:00
Sage Weil	685f9a5d14	ceph: do not confuse stale and dead (unreconnected) caps We were using the cap_gen to track both stale caps (caps that timed out due to temporarily losing touch with the mds) and dead caps that did not reconnect after an MDS failure. Introduce a recon_gen counter to track reconnections to restarted MDSs and kill dead caps based on that instead. Rename gen to cap_gen while we're at it to make it more clear which is which. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-09 12:06:07 -08:00
Noah Watkins	fbbccec9c6	ceph: replace list_entry with container_of Usage of non-list.h list_entry function for container_of functionality replaced with direct use of container_of. Signed-off-by: Noah Watkins <noah@noahdesu.com> Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-28 17:44:22 -07:00
Sage Weil	6b8051855d	ceph: allocate and parse mount args before client instance This simplifies much of the error handling during mount. It also means that we have the mount args before client creation, and we can initialize based on those options. Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-27 11:57:03 -07:00
Sage Weil	ecb19c4649	ceph: remove small mon addr limit; use CEPH_MAX_MON where appropriate Get rid of separate max mon limit; use the system limit instead. This allows mounts when there are lots of mon addrs provided by mount.ceph (as with a host with lots of A/AAAA records). Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-22 10:53:17 -07:00
Sage Weil	8fa9765576	ceph: enable readahead Initialized bdi->ra_pages to enable readahead. Use 512KB default. Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-16 14:44:43 -07:00
Sage Weil	afcdaea3f2	ceph: flush dirty caps via the cap_dirty list Previously we were flushing dirty caps by passing an extra flag when traversing the delayed caps list. Besides being a bit ugly, that can also miss caps that are dirty but didn't result in a cap requeue: notably, mark_caps_dirty(). Separate the flushing into a separate helper, and traverse the cap_dirty list. This also brings i_dirty_item in line with i_dirty_caps: we are on the list IFF caps != 0. We carry an inode ref IFF dirty_caps\|flushing_caps != 0. Lose the unused return value from __ceph_mark_caps_dirty(). Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-15 18:14:35 -07:00
Sage Weil	de57606c23	ceph: client types We first define constants, types, and prototypes for the kernel client proper. A few subsystems are defined separately later: the MDS, OSD, and monitor clients, and the messaging layer. Signed-off-by: Sage Weil <sage@newdream.net>	2009-10-06 11:31:07 -07:00

... 2 3 4 5 6

286 Commits