linux

iv/linux

Author	SHA1	Message	Date
Sage Weil	9dd4658db1	ceph: close messenger race Simplify messenger locking, and close race between ceph_con_close() setting the CLOSED bit and con_work() checking the bit, then taking the mutex. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:25 -07:00
Sage Weil	6f2bc3ff4c	ceph: clean up connection reset Reset out_keepalive_pending and peer_global_seq, and drop unused var. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:20 -07:00
Sage Weil	bb257664f7	ceph: simplify ceph_msg_new We only need to pass in front_len. Callers can attach any other payload pieces (middle, data) as they see fit. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:19 -07:00
Sage Weil	a79832f26b	ceph: make ceph_msg_new return NULL on failure; clean up, fix callers Returning ERR_PTR(-ENOMEM) is useless extra work. Return NULL on failure instead, and fix up the callers (about half of which were wrong anyway). Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:18 -07:00
Yehuda Sadeh	31459fe4b2	ceph: use __page_cache_alloc and add_to_page_cache_lru Following Nick Piggin patches in btrfs, pagecache pages should be allocated with __page_cache_alloc, so they obey pagecache memory policies. Also, using add_to_page_cache_lru instead of using a private pagevec where applicable. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-17 15:25:12 -07:00
Sage Weil	e84346b726	ceph: preserve seq # on requeued messages after transient transport errors If the tcp connection drops and we reconnect to reestablish a stateful session (with the mds), we need to resend previously sent (and possibly received) messages with the _same_ seq # so that they can be dropped on the other end if needed. Only assign a new seq once after the message is queued. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 21:20:38 -07:00
Sage Weil	45c6ceb547	ceph: zero unused message header, footer fields We shouldn't leak any prior memory contents to other parties. And random data, particularly in the 'version' field, can cause problems down the line. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-11 15:17:40 -07:00
Sage Weil	ae18756b9f	ceph: discard incoming messages with bad seq # We can get old message seq #'s after a tcp reconnect for stateful sessions (i.e., the MDS). If we get a higher seq #, that is an error, and we shouldn't see any bad seq #'s for stateless (mon, osd) connections. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Sage Weil	684be25c52	ceph: fix seq counting for skipped messages Increment in_seq even when the message is skipped for some reason. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-03 10:49:24 -07:00
Linus Torvalds	96e35b40c0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use separate class for ceph sockets' sk_lock ceph: reserve one more caps space when doing readdir ceph: queue_cap_snap should always queue dirty context ceph: fix dentry reference leak in dcache readdir ceph: decode v5 of osdmap (pool names) [protocol change] ceph: fix ack counter reset on connection reset ceph: fix leaked inode ref due to snap metadata writeback race ceph: fix snap context reference leaks ceph: allow writeback of snapped pages older than 'oldest' snapc ceph: fix dentry rehashing on virtual .snap dir	2010-04-14 18:45:31 -07:00
Sage Weil	a6a5349d17	ceph: use separate class for ceph sockets' sk_lock Use a separate class for ceph sockets to prevent lockdep confusion. Because ceph sockets only get passed kernel pointers, there is no dependency from sk_lock -> mmap_sem. If we share the same class as other sockets, lockdep detects a circular dependency from mmap_sem (page fault) -> fs mutex -> sk_lock -> mmap_sem because dependencies are noted from both ceph and user contexts. Using a separate class prevents the sk_lock(ceph) -> mmap_sem dependency and makes lockdep happy. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-13 14:07:07 -07:00
Sage Weil	0e0d5e0c4b	ceph: fix ack counter reset on connection reset If in_seq_acked isn't reset along with in_seq, we don't ack received messages until we reach the old count, consuming gobs memory on the other end of the connection and introducing a large delay when those messages are eventually deleted. Signed-off-by: Sage Weil <sage@newdream.net>	2010-04-02 16:07:19 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Sage Weil	87b315a5b5	ceph: avoid reopening osd connections when address hasn't changed We get a fault callback on _every_ tcp connection fault. Normally, we want to reopen the connection when that happens. If the address we have is bad, however, and connection attempts always result in a connection refused or similar error, explicitly closing and reopening the msgr connection just prevents the messenger's backoff logic from kicking in. The result can be a console full of [ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed [ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed [ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed Instead, if we get a fault, and have outstanding requests, but the osd address hasn't changed and the connection never successfully connected in the first place, do nothing to the osd connection. The messenger layer will back off and retry periodically, because we never connected and thus the lossy bit is not set. Instead, touch each request's r_stamp so that handle_timeout can tell the request is still alive and kicking. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-23 07:47:01 -07:00
Sage Weil	3c3f2e32ef	ceph: fix connection fault con_work reentrancy problem The messenger fault was clearing the BUSY bit, for reasons unclear. This made it possible for the con->ops->fault function to reopen the connection, and requeue work in the workqueue--even though the current thread was already in con_work. This avoids a problem where the client busy loops with connection failures on an unreachable OSD, but doesn't address the root cause of that problem. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-23 07:46:59 -07:00
Sage Weil	63733a0fc5	ceph: fix authenticator timeout We were failing to reconnect to services due to an old authenticator, even though we had the new ticket, because we weren't properly retrying the connect handshake, because we were calling an old/incorrect helper that left in_base_pos incorrect. The result was a failure to reconnect to the OSD or MDS (with an authentication error) if the MDS restarted after the service had been up a few hours (long enough for the original authenticator to be invalid). This was only a problem if the AUTH_X authentication was enabled. Now that the 'negotiate' and 'connect' stages are fully separated, we should use the prepare_read_connect() helper instead, and remove the obsolete one. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-20 21:33:09 -07:00
Sage Weil	3ca02ef96e	ceph: reset front len on return to msgpool; BUG on mismatched front iov Reset msg front len when a message is returned to the pool: the caller may have changed it. BUG if we try to send a message with a hdr.front_len that doesn't match the front iov. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-01 15:25:00 -08:00
Sage Weil	1679f876a6	ceph: reset bits on connection close Clear LOSSYTX bit, so that if/when we reconnect, said reconnect will retry on failure. Clear _PENDING bits too, to avoid polluting subsequent connection state. Drop unused REGISTERED bit. Signed-off-by: Sage Weil <sage@newdream.net>	2010-03-01 15:19:51 -08:00
Sage Weil	e80a52d14f	ceph: fix connection fault STANDBY check Move any out_sent messages to out_queue _before_ checking if out_queue is empty and going to STANDBY, or else we may drop something that was never acked. And clean up the code a bit (less goto). Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-25 12:40:45 -08:00
Sage Weil	161fd65ac9	ceph: invalidate_authorizer without con->mutex held This fixes lock ABBA inversion, as the ->invalidate_authorizer() op may need to take a lock (or even call back into the messenger). Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-25 12:38:57 -08:00
Sage Weil	5b3a4db3e4	ceph: fix up unexpected message handling Fix skipping of unexpected message types from osd, mon. Clean up pr_info and debug output. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-23 14:26:29 -08:00
Sage Weil	91e45ce389	ceph: cancel delayed work when closing connection This ensures that if/when we reopen the connection, we can requeue work on the connection immediately, without waiting for an old timer to expire. Queue new delayed work inside con->mutex to avoid any race. This fixes problems with clients failing to reconnect to the MDS due to the client_reconnect message arriving too late (due to waiting for an old delayed work timeout to expire). Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-16 22:01:07 -08:00
Sage Weil	e2663ab60d	ceph: allow connection to be reopened by fault callback Fix the messenger to allow a ceph_con_open() during the fault callback. Previously the work wasn't getting queued on the connection because the fault path avoids requeued work (normally spurious). Loop on reopening by checking for the OPENING state bit. This fixes OSD reconnects when a TCP connection drops. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-16 22:01:03 -08:00
Sage Weil	6c5d1a49e5	ceph: fix msgr to keep sent messages until acked The test was backwards from commit `b3d1dbbd`: keep the message if the connection _isn't_ lossy. This allows the client to continue when the TCP connection drops for some reason (network glitch) but both ends survive. Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-13 20:29:31 -08:00
Sage Weil	9bd2e6f8ba	ceph: allow renewal of auth credentials Add infrastructure to allow the mon_client to periodically renew its auth credentials. Also add a messenger callback that will force such a renewal if a peer rejects our authenticator. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-02-10 15:04:47 -08:00
Sage Weil	ac8839d7b2	ceph: include type in ceph_entity_addr, filepath Include a type/version in ceph_entity_addr and filepath. Include extra byte in filepath encoding as necessary. Signed-off-by: Sage Weil <sage@newdream.net>	2010-01-29 12:41:09 -08:00
Yehuda Sadeh	0d59ab81c3	ceph: keep reserved replies on the request structure This includes treating all the data preallocation and revokation at the same place, not having to have a special case for the reserved pages. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-01-25 12:58:08 -08:00
Yehuda Sadeh	0547a9b30a	ceph: alloc message data pages and check if tid exists Now doing it in the same callback that is also responsible for allocating the 'front' part of the message. If we get a message that we haven't got a corresponding tid for, mark it for skipping. Moving the mutex unlock/lock from the osd alloc_msg callback to the calling function in the messenger. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-01-25 12:57:46 -08:00
Yehuda Sadeh	9d7f0f139e	ceph: refactor messages data section allocation Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-01-25 12:57:43 -08:00
Yehuda Sadeh	2450418c47	ceph: allocate middle of message before stating to read Both front and middle parts of the message are now being allocated at the ceph_alloc_msg(). Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-01-25 12:57:37 -08:00
Sage Weil	103e2d3ae5	ceph: remove unused erank field The ceph_entity_addr erank field is obsolete; remove it. Get rid of trivial addr comparison helpers while we're at it. Signed-off-by: Sage Weil <sage@newdream.net>	2010-01-14 12:23:38 -08:00
Sage Weil	58bb3b374b	ceph: support ceph_pagelist for message payload The ceph_pagelist is a simple list of whole pages, strung together via their lru list_head. It facilitates encoding to a "buffer" of unknown size. Allow its use in place of the ceph_msg page vector. This will be used to fix the huge buffer preallocation woes of MDS reconnection. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-23 12:12:31 -08:00
Sage Weil	04a419f908	ceph: add feature bits to connection handshake (protocol change) Define supported and required feature set. Fail connection if the server requires features we do not support (TAG_FEATURES), or if the server does not support features we require. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-23 09:30:21 -08:00
Sage Weil	350b1c32ea	ceph: control access to page vector for incoming data When we issue an OSD read, we specify a vector of pages that the data is to be read into. The request may be sent multiple times, to multiple OSDs, if the osdmap changes, which means we can get more than one reply. Only read data into the page vector if the reply is coming from the OSD we last sent the request to. Keep track of which connection is using the vector by taking a reference. If another connection was already using the vector before and a new reply comes in on the right connection, revoke the pages from the other connection. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-23 08:17:20 -08:00
Sage Weil	ec302645f4	ceph: use connection mutex to protect read and write stages Use a single mutex (previously out_mutex) to protect both read and write activity from concurrent ceph_con_* calls. Drop the mutex when doing callbacks to avoid nested locking (the callback may need to call something like ceph_con_close). Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-23 08:17:19 -08:00
Yehuda Sadeh	169e16ce81	ceph: remove unaccessible code Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2009-12-21 16:39:55 -08:00
Sage Weil	cf3e5c409b	ceph: plug leak of incoming message during connection fault/close If we explicitly close a connection, or there is a socket error, we need to drop any partially received message. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:53 -08:00
Sage Weil	9ec7cab14e	ceph: hex dump corrupt server data to KERN_DEBUG Also, print fsid using standard format, NOT hex dump. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:52 -08:00
Sage Weil	b3d1dbbdd5	ceph: don't save sent messages on lossy connections For lossy connections we drop all state on socket errors, so there is no reason to keep sent ceph_msg's around. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:50 -08:00
Sage Weil	92ac41d0a4	ceph: detect lossy state of connection The server indicates whether a connection is lossy; set our LOSSYTX bit appropriately. Do not set lossy bit on outgoing connections. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:49 -08:00
Sage Weil	5e095e8b40	ceph: plug msg leak in con_fault Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:49 -08:00
Sage Weil	c86a2930cc	ceph: carry explicit msg reference for currently sending message Carry a ceph_msg reference for connection->out_msg. This will allow us to make out_sent optional. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-21 16:39:38 -08:00
Sage Weil	c2e552e76e	ceph: use kref for ceph_msg Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-07 15:55:05 -08:00
Sage Weil	b6c1d5b81e	ceph: simplify ceph_buffer interface We never allocate the ceph_buffer and buffer separtely, so use a single constructor. Disallow put on NULL buffer; make the caller check. Signed-off-by: Sage Weil <sage@newdream.net>	2009-12-07 12:17:17 -08:00
Sage Weil	03c677e1d1	ceph: reset msgr backoff during open, not after successful handshake Reset the backoff delay when we reopen the connection, so that the delays for any initial connection problems are reasonable. We were resetting only after a successful handshake, which was of limited utility. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-20 15:14:15 -08:00
Sage Weil	4e7a5dcd1b	ceph: negotiate authentication protocol; implement AUTH_NONE protocol When we open a monitor session, we send an initial AUTH message listing the auth protocols we support, our entity name, and (possibly) a previously assigned global_id. The monitor chooses a protocol and responds with an initial message. Initially implement AUTH_NONE, a dummy protocol that provides no security, but works within the new framework. It generates 'authorizers' that are used when connecting to (mds, osd) services that simply state our entity name and global_id. This is a wire protocol change. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-18 16:19:57 -08:00
Sage Weil	71ececdaca	ceph: remove unnecessary ceph_con_shutdown We require that ceph_con_close be called before we drop the connection, so this is unneeded. Just BUG if con->sock != NULL. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-18 11:29:45 -08:00
Sage Weil	eed0ef2caf	ceph: separate banner and connect during handshake into distinct stages We need to make sure we only swab the address during the banner once. So break process_banner out of process_connect, and clean up the surrounding code so that these are distinct phases of the handshake. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-10 14:34:48 -08:00
Sage Weil	f28bcfbe66	ceph: convert port endianness The port is informational only, but we should make it correct. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-04 16:36:12 -08:00
Sage Weil	63f2d21195	ceph: use fixed endian encoding for ceph_entity_addr We exchange struct ceph_entity_addr over the wire and store it on disk. The sockaddr_storage.ss_family field, however, is host endianness. So, fix ss_family endianness to big endian when sending/receiving over the wire. Signed-off-by: Sage Weil <sage@newdream.net>	2009-11-03 15:17:56 -08:00

1 2

52 Commits