Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits) ceph: update for write_inode API change ceph: reset osd after relevant messages timed out ceph: fix flush_dirty_caps race with caps migration ceph: include migrating caps in issued set ceph: fix osdmap decoding when pools include (removed) snaps ceph: return EBADF if waiting for caps on closed file ceph: set osd request message front length correctly ceph: reset front len on return to msgpool; BUG on mismatched front iov ceph: fix snaptrace decoding on cap migration between mds ceph: use single osd op reply msg ceph: reset bits on connection close ceph: remove bogus mds forward warning ceph: remove fragile __map_osds optimization ceph: fix connection fault STANDBY check ceph: invalidate_authorizer without con->mutex held ceph: don't clobber write return value when using O_SYNC ceph: fix client_request_forward decoding ceph: drop messages on unregistered mds sessions; cleanup ceph: fix comments, locking in destroy_inode ceph: move dereference after NULL test ... Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
This commit is contained in:
commit
fc7f99cf36
139
Documentation/filesystems/ceph.txt
Normal file
139
Documentation/filesystems/ceph.txt
Normal file
@ -0,0 +1,139 @@
|
||||
Ceph Distributed File System
|
||||
============================
|
||||
|
||||
Ceph is a distributed network file system designed to provide good
|
||||
performance, reliability, and scalability.
|
||||
|
||||
Basic features include:
|
||||
|
||||
* POSIX semantics
|
||||
* Seamless scaling from 1 to many thousands of nodes
|
||||
* High availability and reliability. No single points of failure.
|
||||
* N-way replication of data across storage nodes
|
||||
* Fast recovery from node failures
|
||||
* Automatic rebalancing of data on node addition/removal
|
||||
* Easy deployment: most FS components are userspace daemons
|
||||
|
||||
Also,
|
||||
* Flexible snapshots (on any directory)
|
||||
* Recursive accounting (nested files, directories, bytes)
|
||||
|
||||
In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
|
||||
on symmetric access by all clients to shared block devices, Ceph
|
||||
separates data and metadata management into independent server
|
||||
clusters, similar to Lustre. Unlike Lustre, however, metadata and
|
||||
storage nodes run entirely as user space daemons. Storage nodes
|
||||
utilize btrfs to store data objects, leveraging its advanced features
|
||||
(checksumming, metadata replication, etc.). File data is striped
|
||||
across storage nodes in large chunks to distribute workload and
|
||||
facilitate high throughputs. When storage nodes fail, data is
|
||||
re-replicated in a distributed fashion by the storage nodes themselves
|
||||
(with some minimal coordination from a cluster monitor), making the
|
||||
system extremely efficient and scalable.
|
||||
|
||||
Metadata servers effectively form a large, consistent, distributed
|
||||
in-memory cache above the file namespace that is extremely scalable,
|
||||
dynamically redistributes metadata in response to workload changes,
|
||||
and can tolerate arbitrary (well, non-Byzantine) node failures. The
|
||||
metadata server takes a somewhat unconventional approach to metadata
|
||||
storage to significantly improve performance for common workloads. In
|
||||
particular, inodes with only a single link are embedded in
|
||||
directories, allowing entire directories of dentries and inodes to be
|
||||
loaded into its cache with a single I/O operation. The contents of
|
||||
extremely large directories can be fragmented and managed by
|
||||
independent metadata servers, allowing scalable concurrent access.
|
||||
|
||||
The system offers automatic data rebalancing/migration when scaling
|
||||
from a small cluster of just a few nodes to many hundreds, without
|
||||
requiring an administrator carve the data set into static volumes or
|
||||
go through the tedious process of migrating data between servers.
|
||||
When the file system approaches full, new nodes can be easily added
|
||||
and things will "just work."
|
||||
|
||||
Ceph includes flexible snapshot mechanism that allows a user to create
|
||||
a snapshot on any subdirectory (and its nested contents) in the
|
||||
system. Snapshot creation and deletion are as simple as 'mkdir
|
||||
.snap/foo' and 'rmdir .snap/foo'.
|
||||
|
||||
Ceph also provides some recursive accounting on directories for nested
|
||||
files and bytes. That is, a 'getfattr -d foo' on any directory in the
|
||||
system will reveal the total number of nested regular files and
|
||||
subdirectories, and a summation of all nested file sizes. This makes
|
||||
the identification of large disk space consumers relatively quick, as
|
||||
no 'du' or similar recursive scan of the file system is required.
|
||||
|
||||
|
||||
Mount Syntax
|
||||
============
|
||||
|
||||
The basic mount syntax is:
|
||||
|
||||
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
|
||||
|
||||
You only need to specify a single monitor, as the client will get the
|
||||
full list when it connects. (However, if the monitor you specify
|
||||
happens to be down, the mount won't succeed.) The port can be left
|
||||
off if the monitor is using the default. So if the monitor is at
|
||||
1.2.3.4,
|
||||
|
||||
# mount -t ceph 1.2.3.4:/ /mnt/ceph
|
||||
|
||||
is sufficient. If /sbin/mount.ceph is installed, a hostname can be
|
||||
used instead of an IP address.
|
||||
|
||||
|
||||
|
||||
Mount Options
|
||||
=============
|
||||
|
||||
ip=A.B.C.D[:N]
|
||||
Specify the IP and/or port the client should bind to locally.
|
||||
There is normally not much reason to do this. If the IP is not
|
||||
specified, the client's IP address is determined by looking at the
|
||||
address it's connection to the monitor originates from.
|
||||
|
||||
wsize=X
|
||||
Specify the maximum write size in bytes. By default there is no
|
||||
maximu. Ceph will normally size writes based on the file stripe
|
||||
size.
|
||||
|
||||
rsize=X
|
||||
Specify the maximum readahead.
|
||||
|
||||
mount_timeout=X
|
||||
Specify the timeout value for mount (in seconds), in the case
|
||||
of a non-responsive Ceph file system. The default is 30
|
||||
seconds.
|
||||
|
||||
rbytes
|
||||
When stat() is called on a directory, set st_size to 'rbytes',
|
||||
the summation of file sizes over all files nested beneath that
|
||||
directory. This is the default.
|
||||
|
||||
norbytes
|
||||
When stat() is called on a directory, set st_size to the
|
||||
number of entries in that directory.
|
||||
|
||||
nocrc
|
||||
Disable CRC32C calculation for data writes. If set, the OSD
|
||||
must rely on TCP's error correction to detect data corruption
|
||||
in the data payload.
|
||||
|
||||
noasyncreaddir
|
||||
Disable client's use its local cache to satisfy readdir
|
||||
requests. (This does not change correctness; the client uses
|
||||
cached metadata only when a lease or capability ensures it is
|
||||
valid.)
|
||||
|
||||
|
||||
More Information
|
||||
================
|
||||
|
||||
For more information on Ceph, see the home page at
|
||||
http://ceph.newdream.net/
|
||||
|
||||
The Linux kernel client source tree is available at
|
||||
git://ceph.newdream.net/linux-ceph-client.git
|
||||
|
||||
and the source for the full system is at
|
||||
git://ceph.newdream.net/ceph.git
|
@ -291,6 +291,7 @@ Code Seq#(hex) Include File Comments
|
||||
0x92 00-0F drivers/usb/mon/mon_bin.c
|
||||
0x93 60-7F linux/auto_fs.h
|
||||
0x94 all fs/btrfs/ioctl.h
|
||||
0x97 00-7F fs/ceph/ioctl.h Ceph file system
|
||||
0x99 00-0F 537-Addinboard driver
|
||||
<mailto:buk@buks.ipn.de>
|
||||
0xA0 all linux/sdp/sdp.h Industrial Device Project
|
||||
|
@ -1441,6 +1441,15 @@ F: arch/powerpc/include/asm/spu*.h
|
||||
F: arch/powerpc/oprofile/*cell*
|
||||
F: arch/powerpc/platforms/cell/
|
||||
|
||||
CEPH DISTRIBUTED FILE SYSTEM CLIENT
|
||||
M: Sage Weil <sage@newdream.net>
|
||||
L: ceph-devel@lists.sourceforge.net
|
||||
W: http://ceph.newdream.net/
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
|
||||
S: Supported
|
||||
F: Documentation/filesystems/ceph.txt
|
||||
F: fs/ceph
|
||||
|
||||
CERTIFIED WIRELESS USB (WUSB) SUBSYSTEM:
|
||||
M: David Vrabel <david.vrabel@csr.com>
|
||||
L: linux-usb@vger.kernel.org
|
||||
|
@ -235,6 +235,7 @@ config NFS_COMMON
|
||||
|
||||
source "net/sunrpc/Kconfig"
|
||||
source "fs/smbfs/Kconfig"
|
||||
source "fs/ceph/Kconfig"
|
||||
source "fs/cifs/Kconfig"
|
||||
source "fs/ncpfs/Kconfig"
|
||||
source "fs/coda/Kconfig"
|
||||
|
@ -125,3 +125,4 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
|
||||
obj-$(CONFIG_BTRFS_FS) += btrfs/
|
||||
obj-$(CONFIG_GFS2_FS) += gfs2/
|
||||
obj-$(CONFIG_EXOFS_FS) += exofs/
|
||||
obj-$(CONFIG_CEPH_FS) += ceph/
|
||||
|
27
fs/ceph/Kconfig
Normal file
27
fs/ceph/Kconfig
Normal file
@ -0,0 +1,27 @@
|
||||
config CEPH_FS
|
||||
tristate "Ceph distributed file system (EXPERIMENTAL)"
|
||||
depends on INET && EXPERIMENTAL
|
||||
select LIBCRC32C
|
||||
select CONFIG_CRYPTO_AES
|
||||
help
|
||||
Choose Y or M here to include support for mounting the
|
||||
experimental Ceph distributed file system. Ceph is an extremely
|
||||
scalable file system designed to provide high performance,
|
||||
reliable access to petabytes of storage.
|
||||
|
||||
More information at http://ceph.newdream.net/.
|
||||
|
||||
If unsure, say N.
|
||||
|
||||
config CEPH_FS_PRETTYDEBUG
|
||||
bool "Include file:line in ceph debug output"
|
||||
depends on CEPH_FS
|
||||
default n
|
||||
help
|
||||
If you say Y here, debug output will include a filename and
|
||||
line to aid debugging. This icnreases kernel size and slows
|
||||
execution slightly when debug call sites are enabled (e.g.,
|
||||
via CONFIG_DYNAMIC_DEBUG).
|
||||
|
||||
If unsure, say N.
|
||||
|
39
fs/ceph/Makefile
Normal file
39
fs/ceph/Makefile
Normal file
@ -0,0 +1,39 @@
|
||||
#
|
||||
# Makefile for CEPH filesystem.
|
||||
#
|
||||
|
||||
ifneq ($(KERNELRELEASE),)
|
||||
|
||||
obj-$(CONFIG_CEPH_FS) += ceph.o
|
||||
|
||||
ceph-objs := super.o inode.o dir.o file.o addr.o ioctl.o \
|
||||
export.o caps.o snap.o xattr.o \
|
||||
messenger.o msgpool.o buffer.o pagelist.o \
|
||||
mds_client.o mdsmap.o \
|
||||
mon_client.o \
|
||||
osd_client.o osdmap.o crush/crush.o crush/mapper.o crush/hash.o \
|
||||
debugfs.o \
|
||||
auth.o auth_none.o \
|
||||
crypto.o armor.o \
|
||||
auth_x.o \
|
||||
ceph_fs.o ceph_strings.o ceph_hash.o ceph_frag.o
|
||||
|
||||
else
|
||||
#Otherwise we were called directly from the command
|
||||
# line; invoke the kernel build system.
|
||||
|
||||
KERNELDIR ?= /lib/modules/$(shell uname -r)/build
|
||||
PWD := $(shell pwd)
|
||||
|
||||
default: all
|
||||
|
||||
all:
|
||||
$(MAKE) -C $(KERNELDIR) M=$(PWD) CONFIG_CEPH_FS=m modules
|
||||
|
||||
modules_install:
|
||||
$(MAKE) -C $(KERNELDIR) M=$(PWD) CONFIG_CEPH_FS=m modules_install
|
||||
|
||||
clean:
|
||||
$(MAKE) -C $(KERNELDIR) M=$(PWD) clean
|
||||
|
||||
endif
|
20
fs/ceph/README
Normal file
20
fs/ceph/README
Normal file
@ -0,0 +1,20 @@
|
||||
#
|
||||
# The following files are shared by (and manually synchronized
|
||||
# between) the Ceph userland and kernel client.
|
||||
#
|
||||
# userland kernel
|
||||
src/include/ceph_fs.h fs/ceph/ceph_fs.h
|
||||
src/include/ceph_fs.cc fs/ceph/ceph_fs.c
|
||||
src/include/msgr.h fs/ceph/msgr.h
|
||||
src/include/rados.h fs/ceph/rados.h
|
||||
src/include/ceph_strings.cc fs/ceph/ceph_strings.c
|
||||
src/include/ceph_frag.h fs/ceph/ceph_frag.h
|
||||
src/include/ceph_frag.cc fs/ceph/ceph_frag.c
|
||||
src/include/ceph_hash.h fs/ceph/ceph_hash.h
|
||||
src/include/ceph_hash.cc fs/ceph/ceph_hash.c
|
||||
src/crush/crush.c fs/ceph/crush/crush.c
|
||||
src/crush/crush.h fs/ceph/crush/crush.h
|
||||
src/crush/mapper.c fs/ceph/crush/mapper.c
|
||||
src/crush/mapper.h fs/ceph/crush/mapper.h
|
||||
src/crush/hash.h fs/ceph/crush/hash.h
|
||||
src/crush/hash.c fs/ceph/crush/hash.c
|
1188
fs/ceph/addr.c
Normal file
1188
fs/ceph/addr.c
Normal file
File diff suppressed because it is too large
Load Diff
99
fs/ceph/armor.c
Normal file
99
fs/ceph/armor.c
Normal file
@ -0,0 +1,99 @@
|
||||
|
||||
#include <linux/errno.h>
|
||||
|
||||
/*
|
||||
* base64 encode/decode.
|
||||
*/
|
||||
|
||||
const char *pem_key = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
|
||||
|
||||
static int encode_bits(int c)
|
||||
{
|
||||
return pem_key[c];
|
||||
}
|
||||
|
||||
static int decode_bits(char c)
|
||||
{
|
||||
if (c >= 'A' && c <= 'Z')
|
||||
return c - 'A';
|
||||
if (c >= 'a' && c <= 'z')
|
||||
return c - 'a' + 26;
|
||||
if (c >= '0' && c <= '9')
|
||||
return c - '0' + 52;
|
||||
if (c == '+')
|
||||
return 62;
|
||||
if (c == '/')
|
||||
return 63;
|
||||
if (c == '=')
|
||||
return 0; /* just non-negative, please */
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
int ceph_armor(char *dst, const char *src, const char *end)
|
||||
{
|
||||
int olen = 0;
|
||||
int line = 0;
|
||||
|
||||
while (src < end) {
|
||||
unsigned char a, b, c;
|
||||
|
||||
a = *src++;
|
||||
*dst++ = encode_bits(a >> 2);
|
||||
if (src < end) {
|
||||
b = *src++;
|
||||
*dst++ = encode_bits(((a & 3) << 4) | (b >> 4));
|
||||
if (src < end) {
|
||||
c = *src++;
|
||||
*dst++ = encode_bits(((b & 15) << 2) |
|
||||
(c >> 6));
|
||||
*dst++ = encode_bits(c & 63);
|
||||
} else {
|
||||
*dst++ = encode_bits((b & 15) << 2);
|
||||
*dst++ = '=';
|
||||
}
|
||||
} else {
|
||||
*dst++ = encode_bits(((a & 3) << 4));
|
||||
*dst++ = '=';
|
||||
*dst++ = '=';
|
||||
}
|
||||
olen += 4;
|
||||
line += 4;
|
||||
if (line == 64) {
|
||||
line = 0;
|
||||
*(dst++) = '\n';
|
||||
olen++;
|
||||
}
|
||||
}
|
||||
return olen;
|
||||
}
|
||||
|
||||
int ceph_unarmor(char *dst, const char *src, const char *end)
|
||||
{
|
||||
int olen = 0;
|
||||
|
||||
while (src < end) {
|
||||
int a, b, c, d;
|
||||
|
||||
if (src < end && src[0] == '\n')
|
||||
src++;
|
||||
if (src + 4 > end)
|
||||
return -EINVAL;
|
||||
a = decode_bits(src[0]);
|
||||
b = decode_bits(src[1]);
|
||||
c = decode_bits(src[2]);
|
||||
d = decode_bits(src[3]);
|
||||
if (a < 0 || b < 0 || c < 0 || d < 0)
|
||||
return -EINVAL;
|
||||
|
||||
*dst++ = (a << 2) | (b >> 4);
|
||||
if (src[2] == '=')
|
||||
return olen + 1;
|
||||
*dst++ = ((b & 15) << 4) | (c >> 2);
|
||||
if (src[3] == '=')
|
||||
return olen + 2;
|
||||
*dst++ = ((c & 3) << 6) | d;
|
||||
olen += 3;
|
||||
src += 4;
|
||||
}
|
||||
return olen;
|
||||
}
|
257
fs/ceph/auth.c
Normal file
257
fs/ceph/auth.c
Normal file
@ -0,0 +1,257 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/module.h>
|
||||
#include <linux/err.h>
|
||||
|
||||
#include "types.h"
|
||||
#include "auth_none.h"
|
||||
#include "auth_x.h"
|
||||
#include "decode.h"
|
||||
#include "super.h"
|
||||
|
||||
#include "messenger.h"
|
||||
|
||||
/*
|
||||
* get protocol handler
|
||||
*/
|
||||
static u32 supported_protocols[] = {
|
||||
CEPH_AUTH_NONE,
|
||||
CEPH_AUTH_CEPHX
|
||||
};
|
||||
|
||||
int ceph_auth_init_protocol(struct ceph_auth_client *ac, int protocol)
|
||||
{
|
||||
switch (protocol) {
|
||||
case CEPH_AUTH_NONE:
|
||||
return ceph_auth_none_init(ac);
|
||||
case CEPH_AUTH_CEPHX:
|
||||
return ceph_x_init(ac);
|
||||
default:
|
||||
return -ENOENT;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* setup, teardown.
|
||||
*/
|
||||
struct ceph_auth_client *ceph_auth_init(const char *name, const char *secret)
|
||||
{
|
||||
struct ceph_auth_client *ac;
|
||||
int ret;
|
||||
|
||||
dout("auth_init name '%s' secret '%s'\n", name, secret);
|
||||
|
||||
ret = -ENOMEM;
|
||||
ac = kzalloc(sizeof(*ac), GFP_NOFS);
|
||||
if (!ac)
|
||||
goto out;
|
||||
|
||||
ac->negotiating = true;
|
||||
if (name)
|
||||
ac->name = name;
|
||||
else
|
||||
ac->name = CEPH_AUTH_NAME_DEFAULT;
|
||||
dout("auth_init name %s secret %s\n", ac->name, secret);
|
||||
ac->secret = secret;
|
||||
return ac;
|
||||
|
||||
out:
|
||||
return ERR_PTR(ret);
|
||||
}
|
||||
|
||||
void ceph_auth_destroy(struct ceph_auth_client *ac)
|
||||
{
|
||||
dout("auth_destroy %p\n", ac);
|
||||
if (ac->ops)
|
||||
ac->ops->destroy(ac);
|
||||
kfree(ac);
|
||||
}
|
||||
|
||||
/*
|
||||
* Reset occurs when reconnecting to the monitor.
|
||||
*/
|
||||
void ceph_auth_reset(struct ceph_auth_client *ac)
|
||||
{
|
||||
dout("auth_reset %p\n", ac);
|
||||
if (ac->ops && !ac->negotiating)
|
||||
ac->ops->reset(ac);
|
||||
ac->negotiating = true;
|
||||
}
|
||||
|
||||
int ceph_entity_name_encode(const char *name, void **p, void *end)
|
||||
{
|
||||
int len = strlen(name);
|
||||
|
||||
if (*p + 2*sizeof(u32) + len > end)
|
||||
return -ERANGE;
|
||||
ceph_encode_32(p, CEPH_ENTITY_TYPE_CLIENT);
|
||||
ceph_encode_32(p, len);
|
||||
ceph_encode_copy(p, name, len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Initiate protocol negotiation with monitor. Include entity name
|
||||
* and list supported protocols.
|
||||
*/
|
||||
int ceph_auth_build_hello(struct ceph_auth_client *ac, void *buf, size_t len)
|
||||
{
|
||||
struct ceph_mon_request_header *monhdr = buf;
|
||||
void *p = monhdr + 1, *end = buf + len, *lenp;
|
||||
int i, num;
|
||||
int ret;
|
||||
|
||||
dout("auth_build_hello\n");
|
||||
monhdr->have_version = 0;
|
||||
monhdr->session_mon = cpu_to_le16(-1);
|
||||
monhdr->session_mon_tid = 0;
|
||||
|
||||
ceph_encode_32(&p, 0); /* no protocol, yet */
|
||||
|
||||
lenp = p;
|
||||
p += sizeof(u32);
|
||||
|
||||
ceph_decode_need(&p, end, 1 + sizeof(u32), bad);
|
||||
ceph_encode_8(&p, 1);
|
||||
num = ARRAY_SIZE(supported_protocols);
|
||||
ceph_encode_32(&p, num);
|
||||
ceph_decode_need(&p, end, num * sizeof(u32), bad);
|
||||
for (i = 0; i < num; i++)
|
||||
ceph_encode_32(&p, supported_protocols[i]);
|
||||
|
||||
ret = ceph_entity_name_encode(ac->name, &p, end);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
ceph_decode_need(&p, end, sizeof(u64), bad);
|
||||
ceph_encode_64(&p, ac->global_id);
|
||||
|
||||
ceph_encode_32(&lenp, p - lenp - sizeof(u32));
|
||||
return p - buf;
|
||||
|
||||
bad:
|
||||
return -ERANGE;
|
||||
}
|
||||
|
||||
int ceph_build_auth_request(struct ceph_auth_client *ac,
|
||||
void *msg_buf, size_t msg_len)
|
||||
{
|
||||
struct ceph_mon_request_header *monhdr = msg_buf;
|
||||
void *p = monhdr + 1;
|
||||
void *end = msg_buf + msg_len;
|
||||
int ret;
|
||||
|
||||
monhdr->have_version = 0;
|
||||
monhdr->session_mon = cpu_to_le16(-1);
|
||||
monhdr->session_mon_tid = 0;
|
||||
|
||||
ceph_encode_32(&p, ac->protocol);
|
||||
|
||||
ret = ac->ops->build_request(ac, p + sizeof(u32), end);
|
||||
if (ret < 0) {
|
||||
pr_err("error %d building request\n", ret);
|
||||
return ret;
|
||||
}
|
||||
dout(" built request %d bytes\n", ret);
|
||||
ceph_encode_32(&p, ret);
|
||||
return p + ret - msg_buf;
|
||||
}
|
||||
|
||||
/*
|
||||
* Handle auth message from monitor.
|
||||
*/
|
||||
int ceph_handle_auth_reply(struct ceph_auth_client *ac,
|
||||
void *buf, size_t len,
|
||||
void *reply_buf, size_t reply_len)
|
||||
{
|
||||
void *p = buf;
|
||||
void *end = buf + len;
|
||||
int protocol;
|
||||
s32 result;
|
||||
u64 global_id;
|
||||
void *payload, *payload_end;
|
||||
int payload_len;
|
||||
char *result_msg;
|
||||
int result_msg_len;
|
||||
int ret = -EINVAL;
|
||||
|
||||
dout("handle_auth_reply %p %p\n", p, end);
|
||||
ceph_decode_need(&p, end, sizeof(u32) * 3 + sizeof(u64), bad);
|
||||
protocol = ceph_decode_32(&p);
|
||||
result = ceph_decode_32(&p);
|
||||
global_id = ceph_decode_64(&p);
|
||||
payload_len = ceph_decode_32(&p);
|
||||
payload = p;
|
||||
p += payload_len;
|
||||
ceph_decode_need(&p, end, sizeof(u32), bad);
|
||||
result_msg_len = ceph_decode_32(&p);
|
||||
result_msg = p;
|
||||
p += result_msg_len;
|
||||
if (p != end)
|
||||
goto bad;
|
||||
|
||||
dout(" result %d '%.*s' gid %llu len %d\n", result, result_msg_len,
|
||||
result_msg, global_id, payload_len);
|
||||
|
||||
payload_end = payload + payload_len;
|
||||
|
||||
if (global_id && ac->global_id != global_id) {
|
||||
dout(" set global_id %lld -> %lld\n", ac->global_id, global_id);
|
||||
ac->global_id = global_id;
|
||||
}
|
||||
|
||||
if (ac->negotiating) {
|
||||
/* server does not support our protocols? */
|
||||
if (!protocol && result < 0) {
|
||||
ret = result;
|
||||
goto out;
|
||||
}
|
||||
/* set up (new) protocol handler? */
|
||||
if (ac->protocol && ac->protocol != protocol) {
|
||||
ac->ops->destroy(ac);
|
||||
ac->protocol = 0;
|
||||
ac->ops = NULL;
|
||||
}
|
||||
if (ac->protocol != protocol) {
|
||||
ret = ceph_auth_init_protocol(ac, protocol);
|
||||
if (ret) {
|
||||
pr_err("error %d on auth protocol %d init\n",
|
||||
ret, protocol);
|
||||
goto out;
|
||||
}
|
||||
}
|
||||
|
||||
ac->negotiating = false;
|
||||
}
|
||||
|
||||
ret = ac->ops->handle_reply(ac, result, payload, payload_end);
|
||||
if (ret == -EAGAIN) {
|
||||
return ceph_build_auth_request(ac, reply_buf, reply_len);
|
||||
} else if (ret) {
|
||||
pr_err("authentication error %d\n", ret);
|
||||
return ret;
|
||||
}
|
||||
return 0;
|
||||
|
||||
bad:
|
||||
pr_err("failed to decode auth msg\n");
|
||||
out:
|
||||
return ret;
|
||||
}
|
||||
|
||||
int ceph_build_auth(struct ceph_auth_client *ac,
|
||||
void *msg_buf, size_t msg_len)
|
||||
{
|
||||
if (!ac->protocol)
|
||||
return ceph_auth_build_hello(ac, msg_buf, msg_len);
|
||||
BUG_ON(!ac->ops);
|
||||
if (!ac->ops->is_authenticated(ac))
|
||||
return ceph_build_auth_request(ac, msg_buf, msg_len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_auth_is_authenticated(struct ceph_auth_client *ac)
|
||||
{
|
||||
if (!ac->ops)
|
||||
return 0;
|
||||
return ac->ops->is_authenticated(ac);
|
||||
}
|
84
fs/ceph/auth.h
Normal file
84
fs/ceph/auth.h
Normal file
@ -0,0 +1,84 @@
|
||||
#ifndef _FS_CEPH_AUTH_H
|
||||
#define _FS_CEPH_AUTH_H
|
||||
|
||||
#include "types.h"
|
||||
#include "buffer.h"
|
||||
|
||||
/*
|
||||
* Abstract interface for communicating with the authenticate module.
|
||||
* There is some handshake that takes place between us and the monitor
|
||||
* to acquire the necessary keys. These are used to generate an
|
||||
* 'authorizer' that we use when connecting to a service (mds, osd).
|
||||
*/
|
||||
|
||||
struct ceph_auth_client;
|
||||
struct ceph_authorizer;
|
||||
|
||||
struct ceph_auth_client_ops {
|
||||
/*
|
||||
* true if we are authenticated and can connect to
|
||||
* services.
|
||||
*/
|
||||
int (*is_authenticated)(struct ceph_auth_client *ac);
|
||||
|
||||
/*
|
||||
* build requests and process replies during monitor
|
||||
* handshake. if handle_reply returns -EAGAIN, we build
|
||||
* another request.
|
||||
*/
|
||||
int (*build_request)(struct ceph_auth_client *ac, void *buf, void *end);
|
||||
int (*handle_reply)(struct ceph_auth_client *ac, int result,
|
||||
void *buf, void *end);
|
||||
|
||||
/*
|
||||
* Create authorizer for connecting to a service, and verify
|
||||
* the response to authenticate the service.
|
||||
*/
|
||||
int (*create_authorizer)(struct ceph_auth_client *ac, int peer_type,
|
||||
struct ceph_authorizer **a,
|
||||
void **buf, size_t *len,
|
||||
void **reply_buf, size_t *reply_len);
|
||||
int (*verify_authorizer_reply)(struct ceph_auth_client *ac,
|
||||
struct ceph_authorizer *a, size_t len);
|
||||
void (*destroy_authorizer)(struct ceph_auth_client *ac,
|
||||
struct ceph_authorizer *a);
|
||||
void (*invalidate_authorizer)(struct ceph_auth_client *ac,
|
||||
int peer_type);
|
||||
|
||||
/* reset when we (re)connect to a monitor */
|
||||
void (*reset)(struct ceph_auth_client *ac);
|
||||
|
||||
void (*destroy)(struct ceph_auth_client *ac);
|
||||
};
|
||||
|
||||
struct ceph_auth_client {
|
||||
u32 protocol; /* CEPH_AUTH_* */
|
||||
void *private; /* for use by protocol implementation */
|
||||
const struct ceph_auth_client_ops *ops; /* null iff protocol==0 */
|
||||
|
||||
bool negotiating; /* true if negotiating protocol */
|
||||
const char *name; /* entity name */
|
||||
u64 global_id; /* our unique id in system */
|
||||
const char *secret; /* our secret key */
|
||||
unsigned want_keys; /* which services we want */
|
||||
};
|
||||
|
||||
extern struct ceph_auth_client *ceph_auth_init(const char *name,
|
||||
const char *secret);
|
||||
extern void ceph_auth_destroy(struct ceph_auth_client *ac);
|
||||
|
||||
extern void ceph_auth_reset(struct ceph_auth_client *ac);
|
||||
|
||||
extern int ceph_auth_build_hello(struct ceph_auth_client *ac,
|
||||
void *buf, size_t len);
|
||||
extern int ceph_handle_auth_reply(struct ceph_auth_client *ac,
|
||||
void *buf, size_t len,
|
||||
void *reply_buf, size_t reply_len);
|
||||
extern int ceph_entity_name_encode(const char *name, void **p, void *end);
|
||||
|
||||
extern int ceph_build_auth(struct ceph_auth_client *ac,
|
||||
void *msg_buf, size_t msg_len);
|
||||
|
||||
extern int ceph_auth_is_authenticated(struct ceph_auth_client *ac);
|
||||
|
||||
#endif
|
121
fs/ceph/auth_none.c
Normal file
121
fs/ceph/auth_none.c
Normal file
@ -0,0 +1,121 @@
|
||||
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/err.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/random.h>
|
||||
|
||||
#include "auth_none.h"
|
||||
#include "auth.h"
|
||||
#include "decode.h"
|
||||
|
||||
static void reset(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_auth_none_info *xi = ac->private;
|
||||
|
||||
xi->starting = true;
|
||||
xi->built_authorizer = false;
|
||||
}
|
||||
|
||||
static void destroy(struct ceph_auth_client *ac)
|
||||
{
|
||||
kfree(ac->private);
|
||||
ac->private = NULL;
|
||||
}
|
||||
|
||||
static int is_authenticated(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_auth_none_info *xi = ac->private;
|
||||
|
||||
return !xi->starting;
|
||||
}
|
||||
|
||||
/*
|
||||
* the generic auth code decode the global_id, and we carry no actual
|
||||
* authenticate state, so nothing happens here.
|
||||
*/
|
||||
static int handle_reply(struct ceph_auth_client *ac, int result,
|
||||
void *buf, void *end)
|
||||
{
|
||||
struct ceph_auth_none_info *xi = ac->private;
|
||||
|
||||
xi->starting = false;
|
||||
return result;
|
||||
}
|
||||
|
||||
/*
|
||||
* build an 'authorizer' with our entity_name and global_id. we can
|
||||
* reuse a single static copy since it is identical for all services
|
||||
* we connect to.
|
||||
*/
|
||||
static int ceph_auth_none_create_authorizer(
|
||||
struct ceph_auth_client *ac, int peer_type,
|
||||
struct ceph_authorizer **a,
|
||||
void **buf, size_t *len,
|
||||
void **reply_buf, size_t *reply_len)
|
||||
{
|
||||
struct ceph_auth_none_info *ai = ac->private;
|
||||
struct ceph_none_authorizer *au = &ai->au;
|
||||
void *p, *end;
|
||||
int ret;
|
||||
|
||||
if (!ai->built_authorizer) {
|
||||
p = au->buf;
|
||||
end = p + sizeof(au->buf);
|
||||
ceph_encode_8(&p, 1);
|
||||
ret = ceph_entity_name_encode(ac->name, &p, end - 8);
|
||||
if (ret < 0)
|
||||
goto bad;
|
||||
ceph_decode_need(&p, end, sizeof(u64), bad2);
|
||||
ceph_encode_64(&p, ac->global_id);
|
||||
au->buf_len = p - (void *)au->buf;
|
||||
ai->built_authorizer = true;
|
||||
dout("built authorizer len %d\n", au->buf_len);
|
||||
}
|
||||
|
||||
*a = (struct ceph_authorizer *)au;
|
||||
*buf = au->buf;
|
||||
*len = au->buf_len;
|
||||
*reply_buf = au->reply_buf;
|
||||
*reply_len = sizeof(au->reply_buf);
|
||||
return 0;
|
||||
|
||||
bad2:
|
||||
ret = -ERANGE;
|
||||
bad:
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void ceph_auth_none_destroy_authorizer(struct ceph_auth_client *ac,
|
||||
struct ceph_authorizer *a)
|
||||
{
|
||||
/* nothing to do */
|
||||
}
|
||||
|
||||
static const struct ceph_auth_client_ops ceph_auth_none_ops = {
|
||||
.reset = reset,
|
||||
.destroy = destroy,
|
||||
.is_authenticated = is_authenticated,
|
||||
.handle_reply = handle_reply,
|
||||
.create_authorizer = ceph_auth_none_create_authorizer,
|
||||
.destroy_authorizer = ceph_auth_none_destroy_authorizer,
|
||||
};
|
||||
|
||||
int ceph_auth_none_init(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_auth_none_info *xi;
|
||||
|
||||
dout("ceph_auth_none_init %p\n", ac);
|
||||
xi = kzalloc(sizeof(*xi), GFP_NOFS);
|
||||
if (!xi)
|
||||
return -ENOMEM;
|
||||
|
||||
xi->starting = true;
|
||||
xi->built_authorizer = false;
|
||||
|
||||
ac->protocol = CEPH_AUTH_NONE;
|
||||
ac->private = xi;
|
||||
ac->ops = &ceph_auth_none_ops;
|
||||
return 0;
|
||||
}
|
||||
|
28
fs/ceph/auth_none.h
Normal file
28
fs/ceph/auth_none.h
Normal file
@ -0,0 +1,28 @@
|
||||
#ifndef _FS_CEPH_AUTH_NONE_H
|
||||
#define _FS_CEPH_AUTH_NONE_H
|
||||
|
||||
#include "auth.h"
|
||||
|
||||
/*
|
||||
* null security mode.
|
||||
*
|
||||
* we use a single static authorizer that simply encodes our entity name
|
||||
* and global id.
|
||||
*/
|
||||
|
||||
struct ceph_none_authorizer {
|
||||
char buf[128];
|
||||
int buf_len;
|
||||
char reply_buf[0];
|
||||
};
|
||||
|
||||
struct ceph_auth_none_info {
|
||||
bool starting;
|
||||
bool built_authorizer;
|
||||
struct ceph_none_authorizer au; /* we only need one; it's static */
|
||||
};
|
||||
|
||||
extern int ceph_auth_none_init(struct ceph_auth_client *ac);
|
||||
|
||||
#endif
|
||||
|
656
fs/ceph/auth_x.c
Normal file
656
fs/ceph/auth_x.c
Normal file
@ -0,0 +1,656 @@
|
||||
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/err.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/random.h>
|
||||
|
||||
#include "auth_x.h"
|
||||
#include "auth_x_protocol.h"
|
||||
#include "crypto.h"
|
||||
#include "auth.h"
|
||||
#include "decode.h"
|
||||
|
||||
struct kmem_cache *ceph_x_ticketbuf_cachep;
|
||||
|
||||
#define TEMP_TICKET_BUF_LEN 256
|
||||
|
||||
static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed);
|
||||
|
||||
static int ceph_x_is_authenticated(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
int need;
|
||||
|
||||
ceph_x_validate_tickets(ac, &need);
|
||||
dout("ceph_x_is_authenticated want=%d need=%d have=%d\n",
|
||||
ac->want_keys, need, xi->have_keys);
|
||||
return (ac->want_keys & xi->have_keys) == ac->want_keys;
|
||||
}
|
||||
|
||||
static int ceph_x_encrypt(struct ceph_crypto_key *secret,
|
||||
void *ibuf, int ilen, void *obuf, size_t olen)
|
||||
{
|
||||
struct ceph_x_encrypt_header head = {
|
||||
.struct_v = 1,
|
||||
.magic = cpu_to_le64(CEPHX_ENC_MAGIC)
|
||||
};
|
||||
size_t len = olen - sizeof(u32);
|
||||
int ret;
|
||||
|
||||
ret = ceph_encrypt2(secret, obuf + sizeof(u32), &len,
|
||||
&head, sizeof(head), ibuf, ilen);
|
||||
if (ret)
|
||||
return ret;
|
||||
ceph_encode_32(&obuf, len);
|
||||
return len + sizeof(u32);
|
||||
}
|
||||
|
||||
static int ceph_x_decrypt(struct ceph_crypto_key *secret,
|
||||
void **p, void *end, void *obuf, size_t olen)
|
||||
{
|
||||
struct ceph_x_encrypt_header head;
|
||||
size_t head_len = sizeof(head);
|
||||
int len, ret;
|
||||
|
||||
len = ceph_decode_32(p);
|
||||
if (*p + len > end)
|
||||
return -EINVAL;
|
||||
|
||||
dout("ceph_x_decrypt len %d\n", len);
|
||||
ret = ceph_decrypt2(secret, &head, &head_len, obuf, &olen,
|
||||
*p, len);
|
||||
if (ret)
|
||||
return ret;
|
||||
if (head.struct_v != 1 || le64_to_cpu(head.magic) != CEPHX_ENC_MAGIC)
|
||||
return -EPERM;
|
||||
*p += len;
|
||||
return olen;
|
||||
}
|
||||
|
||||
/*
|
||||
* get existing (or insert new) ticket handler
|
||||
*/
|
||||
struct ceph_x_ticket_handler *get_ticket_handler(struct ceph_auth_client *ac,
|
||||
int service)
|
||||
{
|
||||
struct ceph_x_ticket_handler *th;
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
struct rb_node *parent = NULL, **p = &xi->ticket_handlers.rb_node;
|
||||
|
||||
while (*p) {
|
||||
parent = *p;
|
||||
th = rb_entry(parent, struct ceph_x_ticket_handler, node);
|
||||
if (service < th->service)
|
||||
p = &(*p)->rb_left;
|
||||
else if (service > th->service)
|
||||
p = &(*p)->rb_right;
|
||||
else
|
||||
return th;
|
||||
}
|
||||
|
||||
/* add it */
|
||||
th = kzalloc(sizeof(*th), GFP_NOFS);
|
||||
if (!th)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
th->service = service;
|
||||
rb_link_node(&th->node, parent, p);
|
||||
rb_insert_color(&th->node, &xi->ticket_handlers);
|
||||
return th;
|
||||
}
|
||||
|
||||
static void remove_ticket_handler(struct ceph_auth_client *ac,
|
||||
struct ceph_x_ticket_handler *th)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
|
||||
dout("remove_ticket_handler %p %d\n", th, th->service);
|
||||
rb_erase(&th->node, &xi->ticket_handlers);
|
||||
ceph_crypto_key_destroy(&th->session_key);
|
||||
if (th->ticket_blob)
|
||||
ceph_buffer_put(th->ticket_blob);
|
||||
kfree(th);
|
||||
}
|
||||
|
||||
static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac,
|
||||
struct ceph_crypto_key *secret,
|
||||
void *buf, void *end)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
int num;
|
||||
void *p = buf;
|
||||
int ret;
|
||||
char *dbuf;
|
||||
char *ticket_buf;
|
||||
u8 struct_v;
|
||||
|
||||
dbuf = kmem_cache_alloc(ceph_x_ticketbuf_cachep, GFP_NOFS | GFP_ATOMIC);
|
||||
if (!dbuf)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = -ENOMEM;
|
||||
ticket_buf = kmem_cache_alloc(ceph_x_ticketbuf_cachep,
|
||||
GFP_NOFS | GFP_ATOMIC);
|
||||
if (!ticket_buf)
|
||||
goto out_dbuf;
|
||||
|
||||
ceph_decode_need(&p, end, 1 + sizeof(u32), bad);
|
||||
struct_v = ceph_decode_8(&p);
|
||||
if (struct_v != 1)
|
||||
goto bad;
|
||||
num = ceph_decode_32(&p);
|
||||
dout("%d tickets\n", num);
|
||||
while (num--) {
|
||||
int type;
|
||||
u8 struct_v;
|
||||
struct ceph_x_ticket_handler *th;
|
||||
void *dp, *dend;
|
||||
int dlen;
|
||||
char is_enc;
|
||||
struct timespec validity;
|
||||
struct ceph_crypto_key old_key;
|
||||
void *tp, *tpend;
|
||||
|
||||
ceph_decode_need(&p, end, sizeof(u32) + 1, bad);
|
||||
|
||||
type = ceph_decode_32(&p);
|
||||
dout(" ticket type %d %s\n", type, ceph_entity_type_name(type));
|
||||
|
||||
struct_v = ceph_decode_8(&p);
|
||||
if (struct_v != 1)
|
||||
goto bad;
|
||||
|
||||
th = get_ticket_handler(ac, type);
|
||||
if (IS_ERR(th)) {
|
||||
ret = PTR_ERR(th);
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* blob for me */
|
||||
dlen = ceph_x_decrypt(secret, &p, end, dbuf,
|
||||
TEMP_TICKET_BUF_LEN);
|
||||
if (dlen <= 0) {
|
||||
ret = dlen;
|
||||
goto out;
|
||||
}
|
||||
dout(" decrypted %d bytes\n", dlen);
|
||||
dend = dbuf + dlen;
|
||||
dp = dbuf;
|
||||
|
||||
struct_v = ceph_decode_8(&dp);
|
||||
if (struct_v != 1)
|
||||
goto bad;
|
||||
|
||||
memcpy(&old_key, &th->session_key, sizeof(old_key));
|
||||
ret = ceph_crypto_key_decode(&th->session_key, &dp, dend);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
ceph_decode_copy(&dp, &th->validity, sizeof(th->validity));
|
||||
ceph_decode_timespec(&validity, &th->validity);
|
||||
th->expires = get_seconds() + validity.tv_sec;
|
||||
th->renew_after = th->expires - (validity.tv_sec / 4);
|
||||
dout(" expires=%lu renew_after=%lu\n", th->expires,
|
||||
th->renew_after);
|
||||
|
||||
/* ticket blob for service */
|
||||
ceph_decode_8_safe(&p, end, is_enc, bad);
|
||||
tp = ticket_buf;
|
||||
if (is_enc) {
|
||||
/* encrypted */
|
||||
dout(" encrypted ticket\n");
|
||||
dlen = ceph_x_decrypt(&old_key, &p, end, ticket_buf,
|
||||
TEMP_TICKET_BUF_LEN);
|
||||
if (dlen < 0) {
|
||||
ret = dlen;
|
||||
goto out;
|
||||
}
|
||||
dlen = ceph_decode_32(&tp);
|
||||
} else {
|
||||
/* unencrypted */
|
||||
ceph_decode_32_safe(&p, end, dlen, bad);
|
||||
ceph_decode_need(&p, end, dlen, bad);
|
||||
ceph_decode_copy(&p, ticket_buf, dlen);
|
||||
}
|
||||
tpend = tp + dlen;
|
||||
dout(" ticket blob is %d bytes\n", dlen);
|
||||
ceph_decode_need(&tp, tpend, 1 + sizeof(u64), bad);
|
||||
struct_v = ceph_decode_8(&tp);
|
||||
th->secret_id = ceph_decode_64(&tp);
|
||||
ret = ceph_decode_buffer(&th->ticket_blob, &tp, tpend);
|
||||
if (ret)
|
||||
goto out;
|
||||
dout(" got ticket service %d (%s) secret_id %lld len %d\n",
|
||||
type, ceph_entity_type_name(type), th->secret_id,
|
||||
(int)th->ticket_blob->vec.iov_len);
|
||||
xi->have_keys |= th->service;
|
||||
}
|
||||
|
||||
ret = 0;
|
||||
out:
|
||||
kmem_cache_free(ceph_x_ticketbuf_cachep, ticket_buf);
|
||||
out_dbuf:
|
||||
kmem_cache_free(ceph_x_ticketbuf_cachep, dbuf);
|
||||
return ret;
|
||||
|
||||
bad:
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
|
||||
static int ceph_x_build_authorizer(struct ceph_auth_client *ac,
|
||||
struct ceph_x_ticket_handler *th,
|
||||
struct ceph_x_authorizer *au)
|
||||
{
|
||||
int len;
|
||||
struct ceph_x_authorize_a *msg_a;
|
||||
struct ceph_x_authorize_b msg_b;
|
||||
void *p, *end;
|
||||
int ret;
|
||||
int ticket_blob_len =
|
||||
(th->ticket_blob ? th->ticket_blob->vec.iov_len : 0);
|
||||
|
||||
dout("build_authorizer for %s %p\n",
|
||||
ceph_entity_type_name(th->service), au);
|
||||
|
||||
len = sizeof(*msg_a) + sizeof(msg_b) + sizeof(u32) +
|
||||
ticket_blob_len + 16;
|
||||
dout(" need len %d\n", len);
|
||||
if (au->buf && au->buf->alloc_len < len) {
|
||||
ceph_buffer_put(au->buf);
|
||||
au->buf = NULL;
|
||||
}
|
||||
if (!au->buf) {
|
||||
au->buf = ceph_buffer_new(len, GFP_NOFS);
|
||||
if (!au->buf)
|
||||
return -ENOMEM;
|
||||
}
|
||||
au->service = th->service;
|
||||
|
||||
msg_a = au->buf->vec.iov_base;
|
||||
msg_a->struct_v = 1;
|
||||
msg_a->global_id = cpu_to_le64(ac->global_id);
|
||||
msg_a->service_id = cpu_to_le32(th->service);
|
||||
msg_a->ticket_blob.struct_v = 1;
|
||||
msg_a->ticket_blob.secret_id = cpu_to_le64(th->secret_id);
|
||||
msg_a->ticket_blob.blob_len = cpu_to_le32(ticket_blob_len);
|
||||
if (ticket_blob_len) {
|
||||
memcpy(msg_a->ticket_blob.blob, th->ticket_blob->vec.iov_base,
|
||||
th->ticket_blob->vec.iov_len);
|
||||
}
|
||||
dout(" th %p secret_id %lld %lld\n", th, th->secret_id,
|
||||
le64_to_cpu(msg_a->ticket_blob.secret_id));
|
||||
|
||||
p = msg_a + 1;
|
||||
p += ticket_blob_len;
|
||||
end = au->buf->vec.iov_base + au->buf->vec.iov_len;
|
||||
|
||||
get_random_bytes(&au->nonce, sizeof(au->nonce));
|
||||
msg_b.struct_v = 1;
|
||||
msg_b.nonce = cpu_to_le64(au->nonce);
|
||||
ret = ceph_x_encrypt(&th->session_key, &msg_b, sizeof(msg_b),
|
||||
p, end - p);
|
||||
if (ret < 0)
|
||||
goto out_buf;
|
||||
p += ret;
|
||||
au->buf->vec.iov_len = p - au->buf->vec.iov_base;
|
||||
dout(" built authorizer nonce %llx len %d\n", au->nonce,
|
||||
(int)au->buf->vec.iov_len);
|
||||
return 0;
|
||||
|
||||
out_buf:
|
||||
ceph_buffer_put(au->buf);
|
||||
au->buf = NULL;
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int ceph_x_encode_ticket(struct ceph_x_ticket_handler *th,
|
||||
void **p, void *end)
|
||||
{
|
||||
ceph_decode_need(p, end, 1 + sizeof(u64), bad);
|
||||
ceph_encode_8(p, 1);
|
||||
ceph_encode_64(p, th->secret_id);
|
||||
if (th->ticket_blob) {
|
||||
const char *buf = th->ticket_blob->vec.iov_base;
|
||||
u32 len = th->ticket_blob->vec.iov_len;
|
||||
|
||||
ceph_encode_32_safe(p, end, len, bad);
|
||||
ceph_encode_copy_safe(p, end, buf, len, bad);
|
||||
} else {
|
||||
ceph_encode_32_safe(p, end, 0, bad);
|
||||
}
|
||||
|
||||
return 0;
|
||||
bad:
|
||||
return -ERANGE;
|
||||
}
|
||||
|
||||
static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed)
|
||||
{
|
||||
int want = ac->want_keys;
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
int service;
|
||||
|
||||
*pneed = ac->want_keys & ~(xi->have_keys);
|
||||
|
||||
for (service = 1; service <= want; service <<= 1) {
|
||||
struct ceph_x_ticket_handler *th;
|
||||
|
||||
if (!(ac->want_keys & service))
|
||||
continue;
|
||||
|
||||
if (*pneed & service)
|
||||
continue;
|
||||
|
||||
th = get_ticket_handler(ac, service);
|
||||
|
||||
if (!th) {
|
||||
*pneed |= service;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (get_seconds() >= th->renew_after)
|
||||
*pneed |= service;
|
||||
if (get_seconds() >= th->expires)
|
||||
xi->have_keys &= ~service;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
static int ceph_x_build_request(struct ceph_auth_client *ac,
|
||||
void *buf, void *end)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
int need;
|
||||
struct ceph_x_request_header *head = buf;
|
||||
int ret;
|
||||
struct ceph_x_ticket_handler *th =
|
||||
get_ticket_handler(ac, CEPH_ENTITY_TYPE_AUTH);
|
||||
|
||||
ceph_x_validate_tickets(ac, &need);
|
||||
|
||||
dout("build_request want %x have %x need %x\n",
|
||||
ac->want_keys, xi->have_keys, need);
|
||||
|
||||
if (need & CEPH_ENTITY_TYPE_AUTH) {
|
||||
struct ceph_x_authenticate *auth = (void *)(head + 1);
|
||||
void *p = auth + 1;
|
||||
struct ceph_x_challenge_blob tmp;
|
||||
char tmp_enc[40];
|
||||
u64 *u;
|
||||
|
||||
if (p > end)
|
||||
return -ERANGE;
|
||||
|
||||
dout(" get_auth_session_key\n");
|
||||
head->op = cpu_to_le16(CEPHX_GET_AUTH_SESSION_KEY);
|
||||
|
||||
/* encrypt and hash */
|
||||
get_random_bytes(&auth->client_challenge, sizeof(u64));
|
||||
tmp.client_challenge = auth->client_challenge;
|
||||
tmp.server_challenge = cpu_to_le64(xi->server_challenge);
|
||||
ret = ceph_x_encrypt(&xi->secret, &tmp, sizeof(tmp),
|
||||
tmp_enc, sizeof(tmp_enc));
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
auth->struct_v = 1;
|
||||
auth->key = 0;
|
||||
for (u = (u64 *)tmp_enc; u + 1 <= (u64 *)(tmp_enc + ret); u++)
|
||||
auth->key ^= *u;
|
||||
dout(" server_challenge %llx client_challenge %llx key %llx\n",
|
||||
xi->server_challenge, le64_to_cpu(auth->client_challenge),
|
||||
le64_to_cpu(auth->key));
|
||||
|
||||
/* now encode the old ticket if exists */
|
||||
ret = ceph_x_encode_ticket(th, &p, end);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
return p - buf;
|
||||
}
|
||||
|
||||
if (need) {
|
||||
void *p = head + 1;
|
||||
struct ceph_x_service_ticket_request *req;
|
||||
|
||||
if (p > end)
|
||||
return -ERANGE;
|
||||
head->op = cpu_to_le16(CEPHX_GET_PRINCIPAL_SESSION_KEY);
|
||||
|
||||
BUG_ON(!th);
|
||||
ret = ceph_x_build_authorizer(ac, th, &xi->auth_authorizer);
|
||||
if (ret)
|
||||
return ret;
|
||||
ceph_encode_copy(&p, xi->auth_authorizer.buf->vec.iov_base,
|
||||
xi->auth_authorizer.buf->vec.iov_len);
|
||||
|
||||
req = p;
|
||||
req->keys = cpu_to_le32(need);
|
||||
p += sizeof(*req);
|
||||
return p - buf;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int ceph_x_handle_reply(struct ceph_auth_client *ac, int result,
|
||||
void *buf, void *end)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
struct ceph_x_reply_header *head = buf;
|
||||
struct ceph_x_ticket_handler *th;
|
||||
int len = end - buf;
|
||||
int op;
|
||||
int ret;
|
||||
|
||||
if (result)
|
||||
return result; /* XXX hmm? */
|
||||
|
||||
if (xi->starting) {
|
||||
/* it's a hello */
|
||||
struct ceph_x_server_challenge *sc = buf;
|
||||
|
||||
if (len != sizeof(*sc))
|
||||
return -EINVAL;
|
||||
xi->server_challenge = le64_to_cpu(sc->server_challenge);
|
||||
dout("handle_reply got server challenge %llx\n",
|
||||
xi->server_challenge);
|
||||
xi->starting = false;
|
||||
xi->have_keys &= ~CEPH_ENTITY_TYPE_AUTH;
|
||||
return -EAGAIN;
|
||||
}
|
||||
|
||||
op = le32_to_cpu(head->op);
|
||||
result = le32_to_cpu(head->result);
|
||||
dout("handle_reply op %d result %d\n", op, result);
|
||||
switch (op) {
|
||||
case CEPHX_GET_AUTH_SESSION_KEY:
|
||||
/* verify auth key */
|
||||
ret = ceph_x_proc_ticket_reply(ac, &xi->secret,
|
||||
buf + sizeof(*head), end);
|
||||
break;
|
||||
|
||||
case CEPHX_GET_PRINCIPAL_SESSION_KEY:
|
||||
th = get_ticket_handler(ac, CEPH_ENTITY_TYPE_AUTH);
|
||||
BUG_ON(!th);
|
||||
ret = ceph_x_proc_ticket_reply(ac, &th->session_key,
|
||||
buf + sizeof(*head), end);
|
||||
break;
|
||||
|
||||
default:
|
||||
return -EINVAL;
|
||||
}
|
||||
if (ret)
|
||||
return ret;
|
||||
if (ac->want_keys == xi->have_keys)
|
||||
return 0;
|
||||
return -EAGAIN;
|
||||
}
|
||||
|
||||
static int ceph_x_create_authorizer(
|
||||
struct ceph_auth_client *ac, int peer_type,
|
||||
struct ceph_authorizer **a,
|
||||
void **buf, size_t *len,
|
||||
void **reply_buf, size_t *reply_len)
|
||||
{
|
||||
struct ceph_x_authorizer *au;
|
||||
struct ceph_x_ticket_handler *th;
|
||||
int ret;
|
||||
|
||||
th = get_ticket_handler(ac, peer_type);
|
||||
if (IS_ERR(th))
|
||||
return PTR_ERR(th);
|
||||
|
||||
au = kzalloc(sizeof(*au), GFP_NOFS);
|
||||
if (!au)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = ceph_x_build_authorizer(ac, th, au);
|
||||
if (ret) {
|
||||
kfree(au);
|
||||
return ret;
|
||||
}
|
||||
|
||||
*a = (struct ceph_authorizer *)au;
|
||||
*buf = au->buf->vec.iov_base;
|
||||
*len = au->buf->vec.iov_len;
|
||||
*reply_buf = au->reply_buf;
|
||||
*reply_len = sizeof(au->reply_buf);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int ceph_x_verify_authorizer_reply(struct ceph_auth_client *ac,
|
||||
struct ceph_authorizer *a, size_t len)
|
||||
{
|
||||
struct ceph_x_authorizer *au = (void *)a;
|
||||
struct ceph_x_ticket_handler *th;
|
||||
int ret = 0;
|
||||
struct ceph_x_authorize_reply reply;
|
||||
void *p = au->reply_buf;
|
||||
void *end = p + sizeof(au->reply_buf);
|
||||
|
||||
th = get_ticket_handler(ac, au->service);
|
||||
if (!th)
|
||||
return -EIO; /* hrm! */
|
||||
ret = ceph_x_decrypt(&th->session_key, &p, end, &reply, sizeof(reply));
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
if (ret != sizeof(reply))
|
||||
return -EPERM;
|
||||
|
||||
if (au->nonce + 1 != le64_to_cpu(reply.nonce_plus_one))
|
||||
ret = -EPERM;
|
||||
else
|
||||
ret = 0;
|
||||
dout("verify_authorizer_reply nonce %llx got %llx ret %d\n",
|
||||
au->nonce, le64_to_cpu(reply.nonce_plus_one), ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void ceph_x_destroy_authorizer(struct ceph_auth_client *ac,
|
||||
struct ceph_authorizer *a)
|
||||
{
|
||||
struct ceph_x_authorizer *au = (void *)a;
|
||||
|
||||
ceph_buffer_put(au->buf);
|
||||
kfree(au);
|
||||
}
|
||||
|
||||
|
||||
static void ceph_x_reset(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
|
||||
dout("reset\n");
|
||||
xi->starting = true;
|
||||
xi->server_challenge = 0;
|
||||
}
|
||||
|
||||
static void ceph_x_destroy(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_x_info *xi = ac->private;
|
||||
struct rb_node *p;
|
||||
|
||||
dout("ceph_x_destroy %p\n", ac);
|
||||
ceph_crypto_key_destroy(&xi->secret);
|
||||
|
||||
while ((p = rb_first(&xi->ticket_handlers)) != NULL) {
|
||||
struct ceph_x_ticket_handler *th =
|
||||
rb_entry(p, struct ceph_x_ticket_handler, node);
|
||||
remove_ticket_handler(ac, th);
|
||||
}
|
||||
|
||||
kmem_cache_destroy(ceph_x_ticketbuf_cachep);
|
||||
|
||||
kfree(ac->private);
|
||||
ac->private = NULL;
|
||||
}
|
||||
|
||||
static void ceph_x_invalidate_authorizer(struct ceph_auth_client *ac,
|
||||
int peer_type)
|
||||
{
|
||||
struct ceph_x_ticket_handler *th;
|
||||
|
||||
th = get_ticket_handler(ac, peer_type);
|
||||
if (th && !IS_ERR(th))
|
||||
remove_ticket_handler(ac, th);
|
||||
}
|
||||
|
||||
|
||||
static const struct ceph_auth_client_ops ceph_x_ops = {
|
||||
.is_authenticated = ceph_x_is_authenticated,
|
||||
.build_request = ceph_x_build_request,
|
||||
.handle_reply = ceph_x_handle_reply,
|
||||
.create_authorizer = ceph_x_create_authorizer,
|
||||
.verify_authorizer_reply = ceph_x_verify_authorizer_reply,
|
||||
.destroy_authorizer = ceph_x_destroy_authorizer,
|
||||
.invalidate_authorizer = ceph_x_invalidate_authorizer,
|
||||
.reset = ceph_x_reset,
|
||||
.destroy = ceph_x_destroy,
|
||||
};
|
||||
|
||||
|
||||
int ceph_x_init(struct ceph_auth_client *ac)
|
||||
{
|
||||
struct ceph_x_info *xi;
|
||||
int ret;
|
||||
|
||||
dout("ceph_x_init %p\n", ac);
|
||||
xi = kzalloc(sizeof(*xi), GFP_NOFS);
|
||||
if (!xi)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = -ENOMEM;
|
||||
ceph_x_ticketbuf_cachep = kmem_cache_create("ceph_x_ticketbuf",
|
||||
TEMP_TICKET_BUF_LEN, 8,
|
||||
(SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD),
|
||||
NULL);
|
||||
if (!ceph_x_ticketbuf_cachep)
|
||||
goto done_nomem;
|
||||
ret = -EINVAL;
|
||||
if (!ac->secret) {
|
||||
pr_err("no secret set (for auth_x protocol)\n");
|
||||
goto done_nomem;
|
||||
}
|
||||
|
||||
ret = ceph_crypto_key_unarmor(&xi->secret, ac->secret);
|
||||
if (ret)
|
||||
goto done_nomem;
|
||||
|
||||
xi->starting = true;
|
||||
xi->ticket_handlers = RB_ROOT;
|
||||
|
||||
ac->protocol = CEPH_AUTH_CEPHX;
|
||||
ac->private = xi;
|
||||
ac->ops = &ceph_x_ops;
|
||||
return 0;
|
||||
|
||||
done_nomem:
|
||||
kfree(xi);
|
||||
if (ceph_x_ticketbuf_cachep)
|
||||
kmem_cache_destroy(ceph_x_ticketbuf_cachep);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
49
fs/ceph/auth_x.h
Normal file
49
fs/ceph/auth_x.h
Normal file
@ -0,0 +1,49 @@
|
||||
#ifndef _FS_CEPH_AUTH_X_H
|
||||
#define _FS_CEPH_AUTH_X_H
|
||||
|
||||
#include <linux/rbtree.h>
|
||||
|
||||
#include "crypto.h"
|
||||
#include "auth.h"
|
||||
#include "auth_x_protocol.h"
|
||||
|
||||
/*
|
||||
* Handle ticket for a single service.
|
||||
*/
|
||||
struct ceph_x_ticket_handler {
|
||||
struct rb_node node;
|
||||
unsigned service;
|
||||
|
||||
struct ceph_crypto_key session_key;
|
||||
struct ceph_timespec validity;
|
||||
|
||||
u64 secret_id;
|
||||
struct ceph_buffer *ticket_blob;
|
||||
|
||||
unsigned long renew_after, expires;
|
||||
};
|
||||
|
||||
|
||||
struct ceph_x_authorizer {
|
||||
struct ceph_buffer *buf;
|
||||
unsigned service;
|
||||
u64 nonce;
|
||||
char reply_buf[128]; /* big enough for encrypted blob */
|
||||
};
|
||||
|
||||
struct ceph_x_info {
|
||||
struct ceph_crypto_key secret;
|
||||
|
||||
bool starting;
|
||||
u64 server_challenge;
|
||||
|
||||
unsigned have_keys;
|
||||
struct rb_root ticket_handlers;
|
||||
|
||||
struct ceph_x_authorizer auth_authorizer;
|
||||
};
|
||||
|
||||
extern int ceph_x_init(struct ceph_auth_client *ac);
|
||||
|
||||
#endif
|
||||
|
90
fs/ceph/auth_x_protocol.h
Normal file
90
fs/ceph/auth_x_protocol.h
Normal file
@ -0,0 +1,90 @@
|
||||
#ifndef __FS_CEPH_AUTH_X_PROTOCOL
|
||||
#define __FS_CEPH_AUTH_X_PROTOCOL
|
||||
|
||||
#define CEPHX_GET_AUTH_SESSION_KEY 0x0100
|
||||
#define CEPHX_GET_PRINCIPAL_SESSION_KEY 0x0200
|
||||
#define CEPHX_GET_ROTATING_KEY 0x0400
|
||||
|
||||
/* common bits */
|
||||
struct ceph_x_ticket_blob {
|
||||
__u8 struct_v;
|
||||
__le64 secret_id;
|
||||
__le32 blob_len;
|
||||
char blob[];
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
/* common request/reply headers */
|
||||
struct ceph_x_request_header {
|
||||
__le16 op;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_reply_header {
|
||||
__le16 op;
|
||||
__le32 result;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
/* authenticate handshake */
|
||||
|
||||
/* initial hello (no reply header) */
|
||||
struct ceph_x_server_challenge {
|
||||
__u8 struct_v;
|
||||
__le64 server_challenge;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_authenticate {
|
||||
__u8 struct_v;
|
||||
__le64 client_challenge;
|
||||
__le64 key;
|
||||
/* ticket blob */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_service_ticket_request {
|
||||
__u8 struct_v;
|
||||
__le32 keys;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_challenge_blob {
|
||||
__le64 server_challenge;
|
||||
__le64 client_challenge;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
|
||||
/* authorize handshake */
|
||||
|
||||
/*
|
||||
* The authorizer consists of two pieces:
|
||||
* a - service id, ticket blob
|
||||
* b - encrypted with session key
|
||||
*/
|
||||
struct ceph_x_authorize_a {
|
||||
__u8 struct_v;
|
||||
__le64 global_id;
|
||||
__le32 service_id;
|
||||
struct ceph_x_ticket_blob ticket_blob;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_authorize_b {
|
||||
__u8 struct_v;
|
||||
__le64 nonce;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_x_authorize_reply {
|
||||
__u8 struct_v;
|
||||
__le64 nonce_plus_one;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
/*
|
||||
* encyption bundle
|
||||
*/
|
||||
#define CEPHX_ENC_MAGIC 0xff009cad8826aa55ull
|
||||
|
||||
struct ceph_x_encrypt_header {
|
||||
__u8 struct_v;
|
||||
__le64 magic;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#endif
|
78
fs/ceph/buffer.c
Normal file
78
fs/ceph/buffer.c
Normal file
@ -0,0 +1,78 @@
|
||||
|
||||
#include "ceph_debug.h"
|
||||
#include "buffer.h"
|
||||
#include "decode.h"
|
||||
|
||||
struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
|
||||
{
|
||||
struct ceph_buffer *b;
|
||||
|
||||
b = kmalloc(sizeof(*b), gfp);
|
||||
if (!b)
|
||||
return NULL;
|
||||
|
||||
b->vec.iov_base = kmalloc(len, gfp | __GFP_NOWARN);
|
||||
if (b->vec.iov_base) {
|
||||
b->is_vmalloc = false;
|
||||
} else {
|
||||
b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
|
||||
if (!b->vec.iov_base) {
|
||||
kfree(b);
|
||||
return NULL;
|
||||
}
|
||||
b->is_vmalloc = true;
|
||||
}
|
||||
|
||||
kref_init(&b->kref);
|
||||
b->alloc_len = len;
|
||||
b->vec.iov_len = len;
|
||||
dout("buffer_new %p\n", b);
|
||||
return b;
|
||||
}
|
||||
|
||||
void ceph_buffer_release(struct kref *kref)
|
||||
{
|
||||
struct ceph_buffer *b = container_of(kref, struct ceph_buffer, kref);
|
||||
|
||||
dout("buffer_release %p\n", b);
|
||||
if (b->vec.iov_base) {
|
||||
if (b->is_vmalloc)
|
||||
vfree(b->vec.iov_base);
|
||||
else
|
||||
kfree(b->vec.iov_base);
|
||||
}
|
||||
kfree(b);
|
||||
}
|
||||
|
||||
int ceph_buffer_alloc(struct ceph_buffer *b, int len, gfp_t gfp)
|
||||
{
|
||||
b->vec.iov_base = kmalloc(len, gfp | __GFP_NOWARN);
|
||||
if (b->vec.iov_base) {
|
||||
b->is_vmalloc = false;
|
||||
} else {
|
||||
b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
|
||||
b->is_vmalloc = true;
|
||||
}
|
||||
if (!b->vec.iov_base)
|
||||
return -ENOMEM;
|
||||
b->alloc_len = len;
|
||||
b->vec.iov_len = len;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_decode_buffer(struct ceph_buffer **b, void **p, void *end)
|
||||
{
|
||||
size_t len;
|
||||
|
||||
ceph_decode_need(p, end, sizeof(u32), bad);
|
||||
len = ceph_decode_32(p);
|
||||
dout("decode_buffer len %d\n", (int)len);
|
||||
ceph_decode_need(p, end, len, bad);
|
||||
*b = ceph_buffer_new(len, GFP_NOFS);
|
||||
if (!*b)
|
||||
return -ENOMEM;
|
||||
ceph_decode_copy(p, (*b)->vec.iov_base, len);
|
||||
return 0;
|
||||
bad:
|
||||
return -EINVAL;
|
||||
}
|
39
fs/ceph/buffer.h
Normal file
39
fs/ceph/buffer.h
Normal file
@ -0,0 +1,39 @@
|
||||
#ifndef __FS_CEPH_BUFFER_H
|
||||
#define __FS_CEPH_BUFFER_H
|
||||
|
||||
#include <linux/kref.h>
|
||||
#include <linux/mm.h>
|
||||
#include <linux/vmalloc.h>
|
||||
#include <linux/types.h>
|
||||
#include <linux/uio.h>
|
||||
|
||||
/*
|
||||
* a simple reference counted buffer.
|
||||
*
|
||||
* use kmalloc for small sizes (<= one page), vmalloc for larger
|
||||
* sizes.
|
||||
*/
|
||||
struct ceph_buffer {
|
||||
struct kref kref;
|
||||
struct kvec vec;
|
||||
size_t alloc_len;
|
||||
bool is_vmalloc;
|
||||
};
|
||||
|
||||
extern struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp);
|
||||
extern void ceph_buffer_release(struct kref *kref);
|
||||
|
||||
static inline struct ceph_buffer *ceph_buffer_get(struct ceph_buffer *b)
|
||||
{
|
||||
kref_get(&b->kref);
|
||||
return b;
|
||||
}
|
||||
|
||||
static inline void ceph_buffer_put(struct ceph_buffer *b)
|
||||
{
|
||||
kref_put(&b->kref, ceph_buffer_release);
|
||||
}
|
||||
|
||||
extern int ceph_decode_buffer(struct ceph_buffer **b, void **p, void *end);
|
||||
|
||||
#endif
|
2927
fs/ceph/caps.c
Normal file
2927
fs/ceph/caps.c
Normal file
File diff suppressed because it is too large
Load Diff
37
fs/ceph/ceph_debug.h
Normal file
37
fs/ceph/ceph_debug.h
Normal file
@ -0,0 +1,37 @@
|
||||
#ifndef _FS_CEPH_DEBUG_H
|
||||
#define _FS_CEPH_DEBUG_H
|
||||
|
||||
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
||||
|
||||
#ifdef CONFIG_CEPH_FS_PRETTYDEBUG
|
||||
|
||||
/*
|
||||
* wrap pr_debug to include a filename:lineno prefix on each line.
|
||||
* this incurs some overhead (kernel size and execution time) due to
|
||||
* the extra function call at each call site.
|
||||
*/
|
||||
|
||||
# if defined(DEBUG) || defined(CONFIG_DYNAMIC_DEBUG)
|
||||
extern const char *ceph_file_part(const char *s, int len);
|
||||
# define dout(fmt, ...) \
|
||||
pr_debug(" %12.12s:%-4d : " fmt, \
|
||||
ceph_file_part(__FILE__, sizeof(__FILE__)), \
|
||||
__LINE__, ##__VA_ARGS__)
|
||||
# else
|
||||
/* faux printk call just to see any compiler warnings. */
|
||||
# define dout(fmt, ...) do { \
|
||||
if (0) \
|
||||
printk(KERN_DEBUG fmt, ##__VA_ARGS__); \
|
||||
} while (0)
|
||||
# endif
|
||||
|
||||
#else
|
||||
|
||||
/*
|
||||
* or, just wrap pr_debug
|
||||
*/
|
||||
# define dout(fmt, ...) pr_debug(" " fmt, ##__VA_ARGS__)
|
||||
|
||||
#endif
|
||||
|
||||
#endif
|
21
fs/ceph/ceph_frag.c
Normal file
21
fs/ceph/ceph_frag.c
Normal file
@ -0,0 +1,21 @@
|
||||
/*
|
||||
* Ceph 'frag' type
|
||||
*/
|
||||
#include "types.h"
|
||||
|
||||
int ceph_frag_compare(__u32 a, __u32 b)
|
||||
{
|
||||
unsigned va = ceph_frag_value(a);
|
||||
unsigned vb = ceph_frag_value(b);
|
||||
if (va < vb)
|
||||
return -1;
|
||||
if (va > vb)
|
||||
return 1;
|
||||
va = ceph_frag_bits(a);
|
||||
vb = ceph_frag_bits(b);
|
||||
if (va < vb)
|
||||
return -1;
|
||||
if (va > vb)
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
109
fs/ceph/ceph_frag.h
Normal file
109
fs/ceph/ceph_frag.h
Normal file
@ -0,0 +1,109 @@
|
||||
#ifndef _FS_CEPH_FRAG_H
|
||||
#define _FS_CEPH_FRAG_H
|
||||
|
||||
/*
|
||||
* "Frags" are a way to describe a subset of a 32-bit number space,
|
||||
* using a mask and a value to match against that mask. Any given frag
|
||||
* (subset of the number space) can be partitioned into 2^n sub-frags.
|
||||
*
|
||||
* Frags are encoded into a 32-bit word:
|
||||
* 8 upper bits = "bits"
|
||||
* 24 lower bits = "value"
|
||||
* (We could go to 5+27 bits, but who cares.)
|
||||
*
|
||||
* We use the _most_ significant bits of the 24 bit value. This makes
|
||||
* values logically sort.
|
||||
*
|
||||
* Unfortunately, because the "bits" field is still in the high bits, we
|
||||
* can't sort encoded frags numerically. However, it does allow you
|
||||
* to feed encoded frags as values into frag_contains_value.
|
||||
*/
|
||||
static inline __u32 ceph_frag_make(__u32 b, __u32 v)
|
||||
{
|
||||
return (b << 24) |
|
||||
(v & (0xffffffu << (24-b)) & 0xffffffu);
|
||||
}
|
||||
static inline __u32 ceph_frag_bits(__u32 f)
|
||||
{
|
||||
return f >> 24;
|
||||
}
|
||||
static inline __u32 ceph_frag_value(__u32 f)
|
||||
{
|
||||
return f & 0xffffffu;
|
||||
}
|
||||
static inline __u32 ceph_frag_mask(__u32 f)
|
||||
{
|
||||
return (0xffffffu << (24-ceph_frag_bits(f))) & 0xffffffu;
|
||||
}
|
||||
static inline __u32 ceph_frag_mask_shift(__u32 f)
|
||||
{
|
||||
return 24 - ceph_frag_bits(f);
|
||||
}
|
||||
|
||||
static inline int ceph_frag_contains_value(__u32 f, __u32 v)
|
||||
{
|
||||
return (v & ceph_frag_mask(f)) == ceph_frag_value(f);
|
||||
}
|
||||
static inline int ceph_frag_contains_frag(__u32 f, __u32 sub)
|
||||
{
|
||||
/* is sub as specific as us, and contained by us? */
|
||||
return ceph_frag_bits(sub) >= ceph_frag_bits(f) &&
|
||||
(ceph_frag_value(sub) & ceph_frag_mask(f)) == ceph_frag_value(f);
|
||||
}
|
||||
|
||||
static inline __u32 ceph_frag_parent(__u32 f)
|
||||
{
|
||||
return ceph_frag_make(ceph_frag_bits(f) - 1,
|
||||
ceph_frag_value(f) & (ceph_frag_mask(f) << 1));
|
||||
}
|
||||
static inline int ceph_frag_is_left_child(__u32 f)
|
||||
{
|
||||
return ceph_frag_bits(f) > 0 &&
|
||||
(ceph_frag_value(f) & (0x1000000 >> ceph_frag_bits(f))) == 0;
|
||||
}
|
||||
static inline int ceph_frag_is_right_child(__u32 f)
|
||||
{
|
||||
return ceph_frag_bits(f) > 0 &&
|
||||
(ceph_frag_value(f) & (0x1000000 >> ceph_frag_bits(f))) == 1;
|
||||
}
|
||||
static inline __u32 ceph_frag_sibling(__u32 f)
|
||||
{
|
||||
return ceph_frag_make(ceph_frag_bits(f),
|
||||
ceph_frag_value(f) ^ (0x1000000 >> ceph_frag_bits(f)));
|
||||
}
|
||||
static inline __u32 ceph_frag_left_child(__u32 f)
|
||||
{
|
||||
return ceph_frag_make(ceph_frag_bits(f)+1, ceph_frag_value(f));
|
||||
}
|
||||
static inline __u32 ceph_frag_right_child(__u32 f)
|
||||
{
|
||||
return ceph_frag_make(ceph_frag_bits(f)+1,
|
||||
ceph_frag_value(f) | (0x1000000 >> (1+ceph_frag_bits(f))));
|
||||
}
|
||||
static inline __u32 ceph_frag_make_child(__u32 f, int by, int i)
|
||||
{
|
||||
int newbits = ceph_frag_bits(f) + by;
|
||||
return ceph_frag_make(newbits,
|
||||
ceph_frag_value(f) | (i << (24 - newbits)));
|
||||
}
|
||||
static inline int ceph_frag_is_leftmost(__u32 f)
|
||||
{
|
||||
return ceph_frag_value(f) == 0;
|
||||
}
|
||||
static inline int ceph_frag_is_rightmost(__u32 f)
|
||||
{
|
||||
return ceph_frag_value(f) == ceph_frag_mask(f);
|
||||
}
|
||||
static inline __u32 ceph_frag_next(__u32 f)
|
||||
{
|
||||
return ceph_frag_make(ceph_frag_bits(f),
|
||||
ceph_frag_value(f) + (0x1000000 >> ceph_frag_bits(f)));
|
||||
}
|
||||
|
||||
/*
|
||||
* comparator to sort frags logically, as when traversing the
|
||||
* number space in ascending order...
|
||||
*/
|
||||
int ceph_frag_compare(__u32 a, __u32 b);
|
||||
|
||||
#endif
|
74
fs/ceph/ceph_fs.c
Normal file
74
fs/ceph/ceph_fs.c
Normal file
@ -0,0 +1,74 @@
|
||||
/*
|
||||
* Some non-inline ceph helpers
|
||||
*/
|
||||
#include "types.h"
|
||||
|
||||
/*
|
||||
* return true if @layout appears to be valid
|
||||
*/
|
||||
int ceph_file_layout_is_valid(const struct ceph_file_layout *layout)
|
||||
{
|
||||
__u32 su = le32_to_cpu(layout->fl_stripe_unit);
|
||||
__u32 sc = le32_to_cpu(layout->fl_stripe_count);
|
||||
__u32 os = le32_to_cpu(layout->fl_object_size);
|
||||
|
||||
/* stripe unit, object size must be non-zero, 64k increment */
|
||||
if (!su || (su & (CEPH_MIN_STRIPE_UNIT-1)))
|
||||
return 0;
|
||||
if (!os || (os & (CEPH_MIN_STRIPE_UNIT-1)))
|
||||
return 0;
|
||||
/* object size must be a multiple of stripe unit */
|
||||
if (os < su || os % su)
|
||||
return 0;
|
||||
/* stripe count must be non-zero */
|
||||
if (!sc)
|
||||
return 0;
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
int ceph_flags_to_mode(int flags)
|
||||
{
|
||||
#ifdef O_DIRECTORY /* fixme */
|
||||
if ((flags & O_DIRECTORY) == O_DIRECTORY)
|
||||
return CEPH_FILE_MODE_PIN;
|
||||
#endif
|
||||
#ifdef O_LAZY
|
||||
if (flags & O_LAZY)
|
||||
return CEPH_FILE_MODE_LAZY;
|
||||
#endif
|
||||
if ((flags & O_APPEND) == O_APPEND)
|
||||
flags |= O_WRONLY;
|
||||
|
||||
flags &= O_ACCMODE;
|
||||
if ((flags & O_RDWR) == O_RDWR)
|
||||
return CEPH_FILE_MODE_RDWR;
|
||||
if ((flags & O_WRONLY) == O_WRONLY)
|
||||
return CEPH_FILE_MODE_WR;
|
||||
return CEPH_FILE_MODE_RD;
|
||||
}
|
||||
|
||||
int ceph_caps_for_mode(int mode)
|
||||
{
|
||||
switch (mode) {
|
||||
case CEPH_FILE_MODE_PIN:
|
||||
return CEPH_CAP_PIN;
|
||||
case CEPH_FILE_MODE_RD:
|
||||
return CEPH_CAP_PIN | CEPH_CAP_FILE_SHARED |
|
||||
CEPH_CAP_FILE_RD | CEPH_CAP_FILE_CACHE;
|
||||
case CEPH_FILE_MODE_RDWR:
|
||||
return CEPH_CAP_PIN | CEPH_CAP_FILE_SHARED |
|
||||
CEPH_CAP_FILE_EXCL |
|
||||
CEPH_CAP_FILE_RD | CEPH_CAP_FILE_CACHE |
|
||||
CEPH_CAP_FILE_WR | CEPH_CAP_FILE_BUFFER |
|
||||
CEPH_CAP_AUTH_SHARED | CEPH_CAP_AUTH_EXCL |
|
||||
CEPH_CAP_XATTR_SHARED | CEPH_CAP_XATTR_EXCL;
|
||||
case CEPH_FILE_MODE_WR:
|
||||
return CEPH_CAP_PIN | CEPH_CAP_FILE_SHARED |
|
||||
CEPH_CAP_FILE_EXCL |
|
||||
CEPH_CAP_FILE_WR | CEPH_CAP_FILE_BUFFER |
|
||||
CEPH_CAP_AUTH_SHARED | CEPH_CAP_AUTH_EXCL |
|
||||
CEPH_CAP_XATTR_SHARED | CEPH_CAP_XATTR_EXCL;
|
||||
}
|
||||
return 0;
|
||||
}
|
650
fs/ceph/ceph_fs.h
Normal file
650
fs/ceph/ceph_fs.h
Normal file
@ -0,0 +1,650 @@
|
||||
/*
|
||||
* ceph_fs.h - Ceph constants and data types to share between kernel and
|
||||
* user space.
|
||||
*
|
||||
* Most types in this file are defined as little-endian, and are
|
||||
* primarily intended to describe data structures that pass over the
|
||||
* wire or that are stored on disk.
|
||||
*
|
||||
* LGPL2
|
||||
*/
|
||||
|
||||
#ifndef _FS_CEPH_CEPH_FS_H
|
||||
#define _FS_CEPH_CEPH_FS_H
|
||||
|
||||
#include "msgr.h"
|
||||
#include "rados.h"
|
||||
|
||||
/*
|
||||
* Ceph release version
|
||||
*/
|
||||
#define CEPH_VERSION_MAJOR 0
|
||||
#define CEPH_VERSION_MINOR 19
|
||||
#define CEPH_VERSION_PATCH 0
|
||||
|
||||
#define _CEPH_STRINGIFY(x) #x
|
||||
#define CEPH_STRINGIFY(x) _CEPH_STRINGIFY(x)
|
||||
#define CEPH_MAKE_VERSION(x, y, z) CEPH_STRINGIFY(x) "." CEPH_STRINGIFY(y) \
|
||||
"." CEPH_STRINGIFY(z)
|
||||
#define CEPH_VERSION CEPH_MAKE_VERSION(CEPH_VERSION_MAJOR, \
|
||||
CEPH_VERSION_MINOR, CEPH_VERSION_PATCH)
|
||||
|
||||
/*
|
||||
* subprotocol versions. when specific messages types or high-level
|
||||
* protocols change, bump the affected components. we keep rev
|
||||
* internal cluster protocols separately from the public,
|
||||
* client-facing protocol.
|
||||
*/
|
||||
#define CEPH_OSD_PROTOCOL 8 /* cluster internal */
|
||||
#define CEPH_MDS_PROTOCOL 9 /* cluster internal */
|
||||
#define CEPH_MON_PROTOCOL 5 /* cluster internal */
|
||||
#define CEPH_OSDC_PROTOCOL 24 /* server/client */
|
||||
#define CEPH_MDSC_PROTOCOL 32 /* server/client */
|
||||
#define CEPH_MONC_PROTOCOL 15 /* server/client */
|
||||
|
||||
|
||||
#define CEPH_INO_ROOT 1
|
||||
#define CEPH_INO_CEPH 2 /* hidden .ceph dir */
|
||||
|
||||
/* arbitrary limit on max # of monitors (cluster of 3 is typical) */
|
||||
#define CEPH_MAX_MON 31
|
||||
|
||||
|
||||
/*
|
||||
* feature bits
|
||||
*/
|
||||
#define CEPH_FEATURE_SUPPORTED 0
|
||||
#define CEPH_FEATURE_REQUIRED 0
|
||||
|
||||
|
||||
/*
|
||||
* ceph_file_layout - describe data layout for a file/inode
|
||||
*/
|
||||
struct ceph_file_layout {
|
||||
/* file -> object mapping */
|
||||
__le32 fl_stripe_unit; /* stripe unit, in bytes. must be multiple
|
||||
of page size. */
|
||||
__le32 fl_stripe_count; /* over this many objects */
|
||||
__le32 fl_object_size; /* until objects are this big, then move to
|
||||
new objects */
|
||||
__le32 fl_cas_hash; /* 0 = none; 1 = sha256 */
|
||||
|
||||
/* pg -> disk layout */
|
||||
__le32 fl_object_stripe_unit; /* for per-object parity, if any */
|
||||
|
||||
/* object -> pg layout */
|
||||
__le32 fl_pg_preferred; /* preferred primary for pg (-1 for none) */
|
||||
__le32 fl_pg_pool; /* namespace, crush ruleset, rep level */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MIN_STRIPE_UNIT 65536
|
||||
|
||||
int ceph_file_layout_is_valid(const struct ceph_file_layout *layout);
|
||||
|
||||
|
||||
/* crypto algorithms */
|
||||
#define CEPH_CRYPTO_NONE 0x0
|
||||
#define CEPH_CRYPTO_AES 0x1
|
||||
|
||||
/* security/authentication protocols */
|
||||
#define CEPH_AUTH_UNKNOWN 0x0
|
||||
#define CEPH_AUTH_NONE 0x1
|
||||
#define CEPH_AUTH_CEPHX 0x2
|
||||
|
||||
|
||||
/*********************************************
|
||||
* message layer
|
||||
*/
|
||||
|
||||
/*
|
||||
* message types
|
||||
*/
|
||||
|
||||
/* misc */
|
||||
#define CEPH_MSG_SHUTDOWN 1
|
||||
#define CEPH_MSG_PING 2
|
||||
|
||||
/* client <-> monitor */
|
||||
#define CEPH_MSG_MON_MAP 4
|
||||
#define CEPH_MSG_MON_GET_MAP 5
|
||||
#define CEPH_MSG_STATFS 13
|
||||
#define CEPH_MSG_STATFS_REPLY 14
|
||||
#define CEPH_MSG_MON_SUBSCRIBE 15
|
||||
#define CEPH_MSG_MON_SUBSCRIBE_ACK 16
|
||||
#define CEPH_MSG_AUTH 17
|
||||
#define CEPH_MSG_AUTH_REPLY 18
|
||||
|
||||
/* client <-> mds */
|
||||
#define CEPH_MSG_MDS_MAP 21
|
||||
|
||||
#define CEPH_MSG_CLIENT_SESSION 22
|
||||
#define CEPH_MSG_CLIENT_RECONNECT 23
|
||||
|
||||
#define CEPH_MSG_CLIENT_REQUEST 24
|
||||
#define CEPH_MSG_CLIENT_REQUEST_FORWARD 25
|
||||
#define CEPH_MSG_CLIENT_REPLY 26
|
||||
#define CEPH_MSG_CLIENT_CAPS 0x310
|
||||
#define CEPH_MSG_CLIENT_LEASE 0x311
|
||||
#define CEPH_MSG_CLIENT_SNAP 0x312
|
||||
#define CEPH_MSG_CLIENT_CAPRELEASE 0x313
|
||||
|
||||
/* osd */
|
||||
#define CEPH_MSG_OSD_MAP 41
|
||||
#define CEPH_MSG_OSD_OP 42
|
||||
#define CEPH_MSG_OSD_OPREPLY 43
|
||||
|
||||
struct ceph_mon_request_header {
|
||||
__le64 have_version;
|
||||
__le16 session_mon;
|
||||
__le64 session_mon_tid;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mon_statfs {
|
||||
struct ceph_mon_request_header monhdr;
|
||||
struct ceph_fsid fsid;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_statfs {
|
||||
__le64 kb, kb_used, kb_avail;
|
||||
__le64 num_objects;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mon_statfs_reply {
|
||||
struct ceph_fsid fsid;
|
||||
__le64 version;
|
||||
struct ceph_statfs st;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_osd_getmap {
|
||||
struct ceph_mon_request_header monhdr;
|
||||
struct ceph_fsid fsid;
|
||||
__le32 start;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mds_getmap {
|
||||
struct ceph_mon_request_header monhdr;
|
||||
struct ceph_fsid fsid;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_client_mount {
|
||||
struct ceph_mon_request_header monhdr;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mon_subscribe_item {
|
||||
__le64 have_version; __le64 have;
|
||||
__u8 onetime;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mon_subscribe_ack {
|
||||
__le32 duration; /* seconds */
|
||||
struct ceph_fsid fsid;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* mds states
|
||||
* > 0 -> in
|
||||
* <= 0 -> out
|
||||
*/
|
||||
#define CEPH_MDS_STATE_DNE 0 /* down, does not exist. */
|
||||
#define CEPH_MDS_STATE_STOPPED -1 /* down, once existed, but no subtrees.
|
||||
empty log. */
|
||||
#define CEPH_MDS_STATE_BOOT -4 /* up, boot announcement. */
|
||||
#define CEPH_MDS_STATE_STANDBY -5 /* up, idle. waiting for assignment. */
|
||||
#define CEPH_MDS_STATE_CREATING -6 /* up, creating MDS instance. */
|
||||
#define CEPH_MDS_STATE_STARTING -7 /* up, starting previously stopped mds */
|
||||
#define CEPH_MDS_STATE_STANDBY_REPLAY -8 /* up, tailing active node's journal */
|
||||
|
||||
#define CEPH_MDS_STATE_REPLAY 8 /* up, replaying journal. */
|
||||
#define CEPH_MDS_STATE_RESOLVE 9 /* up, disambiguating distributed
|
||||
operations (import, rename, etc.) */
|
||||
#define CEPH_MDS_STATE_RECONNECT 10 /* up, reconnect to clients */
|
||||
#define CEPH_MDS_STATE_REJOIN 11 /* up, rejoining distributed cache */
|
||||
#define CEPH_MDS_STATE_CLIENTREPLAY 12 /* up, replaying client operations */
|
||||
#define CEPH_MDS_STATE_ACTIVE 13 /* up, active */
|
||||
#define CEPH_MDS_STATE_STOPPING 14 /* up, but exporting metadata */
|
||||
|
||||
extern const char *ceph_mds_state_name(int s);
|
||||
|
||||
|
||||
/*
|
||||
* metadata lock types.
|
||||
* - these are bitmasks.. we can compose them
|
||||
* - they also define the lock ordering by the MDS
|
||||
* - a few of these are internal to the mds
|
||||
*/
|
||||
#define CEPH_LOCK_DN 1
|
||||
#define CEPH_LOCK_ISNAP 2
|
||||
#define CEPH_LOCK_IVERSION 4 /* mds internal */
|
||||
#define CEPH_LOCK_IFILE 8 /* mds internal */
|
||||
#define CEPH_LOCK_IAUTH 32
|
||||
#define CEPH_LOCK_ILINK 64
|
||||
#define CEPH_LOCK_IDFT 128 /* dir frag tree */
|
||||
#define CEPH_LOCK_INEST 256 /* mds internal */
|
||||
#define CEPH_LOCK_IXATTR 512
|
||||
#define CEPH_LOCK_INO 2048 /* immutable inode bits; not a lock */
|
||||
|
||||
/* client_session ops */
|
||||
enum {
|
||||
CEPH_SESSION_REQUEST_OPEN,
|
||||
CEPH_SESSION_OPEN,
|
||||
CEPH_SESSION_REQUEST_CLOSE,
|
||||
CEPH_SESSION_CLOSE,
|
||||
CEPH_SESSION_REQUEST_RENEWCAPS,
|
||||
CEPH_SESSION_RENEWCAPS,
|
||||
CEPH_SESSION_STALE,
|
||||
CEPH_SESSION_RECALL_STATE,
|
||||
};
|
||||
|
||||
extern const char *ceph_session_op_name(int op);
|
||||
|
||||
struct ceph_mds_session_head {
|
||||
__le32 op;
|
||||
__le64 seq;
|
||||
struct ceph_timespec stamp;
|
||||
__le32 max_caps, max_leases;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* client_request */
|
||||
/*
|
||||
* metadata ops.
|
||||
* & 0x001000 -> write op
|
||||
* & 0x010000 -> follow symlink (e.g. stat(), not lstat()).
|
||||
& & 0x100000 -> use weird ino/path trace
|
||||
*/
|
||||
#define CEPH_MDS_OP_WRITE 0x001000
|
||||
enum {
|
||||
CEPH_MDS_OP_LOOKUP = 0x00100,
|
||||
CEPH_MDS_OP_GETATTR = 0x00101,
|
||||
CEPH_MDS_OP_LOOKUPHASH = 0x00102,
|
||||
CEPH_MDS_OP_LOOKUPPARENT = 0x00103,
|
||||
|
||||
CEPH_MDS_OP_SETXATTR = 0x01105,
|
||||
CEPH_MDS_OP_RMXATTR = 0x01106,
|
||||
CEPH_MDS_OP_SETLAYOUT = 0x01107,
|
||||
CEPH_MDS_OP_SETATTR = 0x01108,
|
||||
|
||||
CEPH_MDS_OP_MKNOD = 0x01201,
|
||||
CEPH_MDS_OP_LINK = 0x01202,
|
||||
CEPH_MDS_OP_UNLINK = 0x01203,
|
||||
CEPH_MDS_OP_RENAME = 0x01204,
|
||||
CEPH_MDS_OP_MKDIR = 0x01220,
|
||||
CEPH_MDS_OP_RMDIR = 0x01221,
|
||||
CEPH_MDS_OP_SYMLINK = 0x01222,
|
||||
|
||||
CEPH_MDS_OP_CREATE = 0x01301,
|
||||
CEPH_MDS_OP_OPEN = 0x00302,
|
||||
CEPH_MDS_OP_READDIR = 0x00305,
|
||||
|
||||
CEPH_MDS_OP_LOOKUPSNAP = 0x00400,
|
||||
CEPH_MDS_OP_MKSNAP = 0x01400,
|
||||
CEPH_MDS_OP_RMSNAP = 0x01401,
|
||||
CEPH_MDS_OP_LSSNAP = 0x00402,
|
||||
};
|
||||
|
||||
extern const char *ceph_mds_op_name(int op);
|
||||
|
||||
|
||||
#define CEPH_SETATTR_MODE 1
|
||||
#define CEPH_SETATTR_UID 2
|
||||
#define CEPH_SETATTR_GID 4
|
||||
#define CEPH_SETATTR_MTIME 8
|
||||
#define CEPH_SETATTR_ATIME 16
|
||||
#define CEPH_SETATTR_SIZE 32
|
||||
#define CEPH_SETATTR_CTIME 64
|
||||
|
||||
union ceph_mds_request_args {
|
||||
struct {
|
||||
__le32 mask; /* CEPH_CAP_* */
|
||||
} __attribute__ ((packed)) getattr;
|
||||
struct {
|
||||
__le32 mode;
|
||||
__le32 uid;
|
||||
__le32 gid;
|
||||
struct ceph_timespec mtime;
|
||||
struct ceph_timespec atime;
|
||||
__le64 size, old_size; /* old_size needed by truncate */
|
||||
__le32 mask; /* CEPH_SETATTR_* */
|
||||
} __attribute__ ((packed)) setattr;
|
||||
struct {
|
||||
__le32 frag; /* which dir fragment */
|
||||
__le32 max_entries; /* how many dentries to grab */
|
||||
} __attribute__ ((packed)) readdir;
|
||||
struct {
|
||||
__le32 mode;
|
||||
__le32 rdev;
|
||||
} __attribute__ ((packed)) mknod;
|
||||
struct {
|
||||
__le32 mode;
|
||||
} __attribute__ ((packed)) mkdir;
|
||||
struct {
|
||||
__le32 flags;
|
||||
__le32 mode;
|
||||
__le32 stripe_unit; /* layout for newly created file */
|
||||
__le32 stripe_count; /* ... */
|
||||
__le32 object_size;
|
||||
__le32 file_replication;
|
||||
__le32 preferred;
|
||||
} __attribute__ ((packed)) open;
|
||||
struct {
|
||||
__le32 flags;
|
||||
} __attribute__ ((packed)) setxattr;
|
||||
struct {
|
||||
struct ceph_file_layout layout;
|
||||
} __attribute__ ((packed)) setlayout;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MDS_FLAG_REPLAY 1 /* this is a replayed op */
|
||||
#define CEPH_MDS_FLAG_WANT_DENTRY 2 /* want dentry in reply */
|
||||
|
||||
struct ceph_mds_request_head {
|
||||
__le64 oldest_client_tid;
|
||||
__le32 mdsmap_epoch; /* on client */
|
||||
__le32 flags; /* CEPH_MDS_FLAG_* */
|
||||
__u8 num_retry, num_fwd; /* count retry, fwd attempts */
|
||||
__le16 num_releases; /* # include cap/lease release records */
|
||||
__le32 op; /* mds op code */
|
||||
__le32 caller_uid, caller_gid;
|
||||
__le64 ino; /* use this ino for openc, mkdir, mknod,
|
||||
etc. (if replaying) */
|
||||
union ceph_mds_request_args args;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* cap/lease release record */
|
||||
struct ceph_mds_request_release {
|
||||
__le64 ino, cap_id; /* ino and unique cap id */
|
||||
__le32 caps, wanted; /* new issued, wanted */
|
||||
__le32 seq, issue_seq, mseq;
|
||||
__le32 dname_seq; /* if releasing a dentry lease, a */
|
||||
__le32 dname_len; /* string follows. */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* client reply */
|
||||
struct ceph_mds_reply_head {
|
||||
__le32 op;
|
||||
__le32 result;
|
||||
__le32 mdsmap_epoch;
|
||||
__u8 safe; /* true if committed to disk */
|
||||
__u8 is_dentry, is_target; /* true if dentry, target inode records
|
||||
are included with reply */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* one for each node split */
|
||||
struct ceph_frag_tree_split {
|
||||
__le32 frag; /* this frag splits... */
|
||||
__le32 by; /* ...by this many bits */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_frag_tree_head {
|
||||
__le32 nsplits; /* num ceph_frag_tree_split records */
|
||||
struct ceph_frag_tree_split splits[];
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* capability issue, for bundling with mds reply */
|
||||
struct ceph_mds_reply_cap {
|
||||
__le32 caps, wanted; /* caps issued, wanted */
|
||||
__le64 cap_id;
|
||||
__le32 seq, mseq;
|
||||
__le64 realm; /* snap realm */
|
||||
__u8 flags; /* CEPH_CAP_FLAG_* */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_CAP_FLAG_AUTH 1 /* cap is issued by auth mds */
|
||||
|
||||
/* inode record, for bundling with mds reply */
|
||||
struct ceph_mds_reply_inode {
|
||||
__le64 ino;
|
||||
__le64 snapid;
|
||||
__le32 rdev;
|
||||
__le64 version; /* inode version */
|
||||
__le64 xattr_version; /* version for xattr blob */
|
||||
struct ceph_mds_reply_cap cap; /* caps issued for this inode */
|
||||
struct ceph_file_layout layout;
|
||||
struct ceph_timespec ctime, mtime, atime;
|
||||
__le32 time_warp_seq;
|
||||
__le64 size, max_size, truncate_size;
|
||||
__le32 truncate_seq;
|
||||
__le32 mode, uid, gid;
|
||||
__le32 nlink;
|
||||
__le64 files, subdirs, rbytes, rfiles, rsubdirs; /* dir stats */
|
||||
struct ceph_timespec rctime;
|
||||
struct ceph_frag_tree_head fragtree; /* (must be at end of struct) */
|
||||
} __attribute__ ((packed));
|
||||
/* followed by frag array, then symlink string, then xattr blob */
|
||||
|
||||
/* reply_lease follows dname, and reply_inode */
|
||||
struct ceph_mds_reply_lease {
|
||||
__le16 mask; /* lease type(s) */
|
||||
__le32 duration_ms; /* lease duration */
|
||||
__le32 seq;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mds_reply_dirfrag {
|
||||
__le32 frag; /* fragment */
|
||||
__le32 auth; /* auth mds, if this is a delegation point */
|
||||
__le32 ndist; /* number of mds' this is replicated on */
|
||||
__le32 dist[];
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* file access modes */
|
||||
#define CEPH_FILE_MODE_PIN 0
|
||||
#define CEPH_FILE_MODE_RD 1
|
||||
#define CEPH_FILE_MODE_WR 2
|
||||
#define CEPH_FILE_MODE_RDWR 3 /* RD | WR */
|
||||
#define CEPH_FILE_MODE_LAZY 4 /* lazy io */
|
||||
#define CEPH_FILE_MODE_NUM 8 /* bc these are bit fields.. mostly */
|
||||
|
||||
int ceph_flags_to_mode(int flags);
|
||||
|
||||
|
||||
/* capability bits */
|
||||
#define CEPH_CAP_PIN 1 /* no specific capabilities beyond the pin */
|
||||
|
||||
/* generic cap bits */
|
||||
#define CEPH_CAP_GSHARED 1 /* client can reads */
|
||||
#define CEPH_CAP_GEXCL 2 /* client can read and update */
|
||||
#define CEPH_CAP_GCACHE 4 /* (file) client can cache reads */
|
||||
#define CEPH_CAP_GRD 8 /* (file) client can read */
|
||||
#define CEPH_CAP_GWR 16 /* (file) client can write */
|
||||
#define CEPH_CAP_GBUFFER 32 /* (file) client can buffer writes */
|
||||
#define CEPH_CAP_GWREXTEND 64 /* (file) client can extend EOF */
|
||||
#define CEPH_CAP_GLAZYIO 128 /* (file) client can perform lazy io */
|
||||
|
||||
/* per-lock shift */
|
||||
#define CEPH_CAP_SAUTH 2
|
||||
#define CEPH_CAP_SLINK 4
|
||||
#define CEPH_CAP_SXATTR 6
|
||||
#define CEPH_CAP_SFILE 8 /* goes at the end (uses >2 cap bits) */
|
||||
|
||||
#define CEPH_CAP_BITS 16
|
||||
|
||||
/* composed values */
|
||||
#define CEPH_CAP_AUTH_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SAUTH)
|
||||
#define CEPH_CAP_AUTH_EXCL (CEPH_CAP_GEXCL << CEPH_CAP_SAUTH)
|
||||
#define CEPH_CAP_LINK_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SLINK)
|
||||
#define CEPH_CAP_LINK_EXCL (CEPH_CAP_GEXCL << CEPH_CAP_SLINK)
|
||||
#define CEPH_CAP_XATTR_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SXATTR)
|
||||
#define CEPH_CAP_XATTR_EXCL (CEPH_CAP_GEXCL << CEPH_CAP_SXATTR)
|
||||
#define CEPH_CAP_FILE(x) (x << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_SHARED (CEPH_CAP_GSHARED << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_EXCL (CEPH_CAP_GEXCL << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_CACHE (CEPH_CAP_GCACHE << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_RD (CEPH_CAP_GRD << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_WR (CEPH_CAP_GWR << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_BUFFER (CEPH_CAP_GBUFFER << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_WREXTEND (CEPH_CAP_GWREXTEND << CEPH_CAP_SFILE)
|
||||
#define CEPH_CAP_FILE_LAZYIO (CEPH_CAP_GLAZYIO << CEPH_CAP_SFILE)
|
||||
|
||||
/* cap masks (for getattr) */
|
||||
#define CEPH_STAT_CAP_INODE CEPH_CAP_PIN
|
||||
#define CEPH_STAT_CAP_TYPE CEPH_CAP_PIN /* mode >> 12 */
|
||||
#define CEPH_STAT_CAP_SYMLINK CEPH_CAP_PIN
|
||||
#define CEPH_STAT_CAP_UID CEPH_CAP_AUTH_SHARED
|
||||
#define CEPH_STAT_CAP_GID CEPH_CAP_AUTH_SHARED
|
||||
#define CEPH_STAT_CAP_MODE CEPH_CAP_AUTH_SHARED
|
||||
#define CEPH_STAT_CAP_NLINK CEPH_CAP_LINK_SHARED
|
||||
#define CEPH_STAT_CAP_LAYOUT CEPH_CAP_FILE_SHARED
|
||||
#define CEPH_STAT_CAP_MTIME CEPH_CAP_FILE_SHARED
|
||||
#define CEPH_STAT_CAP_SIZE CEPH_CAP_FILE_SHARED
|
||||
#define CEPH_STAT_CAP_ATIME CEPH_CAP_FILE_SHARED /* fixme */
|
||||
#define CEPH_STAT_CAP_XATTR CEPH_CAP_XATTR_SHARED
|
||||
#define CEPH_STAT_CAP_INODE_ALL (CEPH_CAP_PIN | \
|
||||
CEPH_CAP_AUTH_SHARED | \
|
||||
CEPH_CAP_LINK_SHARED | \
|
||||
CEPH_CAP_FILE_SHARED | \
|
||||
CEPH_CAP_XATTR_SHARED)
|
||||
|
||||
#define CEPH_CAP_ANY_SHARED (CEPH_CAP_AUTH_SHARED | \
|
||||
CEPH_CAP_LINK_SHARED | \
|
||||
CEPH_CAP_XATTR_SHARED | \
|
||||
CEPH_CAP_FILE_SHARED)
|
||||
#define CEPH_CAP_ANY_RD (CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_RD | \
|
||||
CEPH_CAP_FILE_CACHE)
|
||||
|
||||
#define CEPH_CAP_ANY_EXCL (CEPH_CAP_AUTH_EXCL | \
|
||||
CEPH_CAP_LINK_EXCL | \
|
||||
CEPH_CAP_XATTR_EXCL | \
|
||||
CEPH_CAP_FILE_EXCL)
|
||||
#define CEPH_CAP_ANY_FILE_WR (CEPH_CAP_FILE_WR | CEPH_CAP_FILE_BUFFER | \
|
||||
CEPH_CAP_FILE_EXCL)
|
||||
#define CEPH_CAP_ANY_WR (CEPH_CAP_ANY_EXCL | CEPH_CAP_ANY_FILE_WR)
|
||||
#define CEPH_CAP_ANY (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
|
||||
CEPH_CAP_ANY_FILE_WR | CEPH_CAP_PIN)
|
||||
|
||||
#define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
|
||||
CEPH_LOCK_IXATTR)
|
||||
|
||||
int ceph_caps_for_mode(int mode);
|
||||
|
||||
enum {
|
||||
CEPH_CAP_OP_GRANT, /* mds->client grant */
|
||||
CEPH_CAP_OP_REVOKE, /* mds->client revoke */
|
||||
CEPH_CAP_OP_TRUNC, /* mds->client trunc notify */
|
||||
CEPH_CAP_OP_EXPORT, /* mds has exported the cap */
|
||||
CEPH_CAP_OP_IMPORT, /* mds has imported the cap */
|
||||
CEPH_CAP_OP_UPDATE, /* client->mds update */
|
||||
CEPH_CAP_OP_DROP, /* client->mds drop cap bits */
|
||||
CEPH_CAP_OP_FLUSH, /* client->mds cap writeback */
|
||||
CEPH_CAP_OP_FLUSH_ACK, /* mds->client flushed */
|
||||
CEPH_CAP_OP_FLUSHSNAP, /* client->mds flush snapped metadata */
|
||||
CEPH_CAP_OP_FLUSHSNAP_ACK, /* mds->client flushed snapped metadata */
|
||||
CEPH_CAP_OP_RELEASE, /* client->mds release (clean) cap */
|
||||
CEPH_CAP_OP_RENEW, /* client->mds renewal request */
|
||||
};
|
||||
|
||||
extern const char *ceph_cap_op_name(int op);
|
||||
|
||||
/*
|
||||
* caps message, used for capability callbacks, acks, requests, etc.
|
||||
*/
|
||||
struct ceph_mds_caps {
|
||||
__le32 op; /* CEPH_CAP_OP_* */
|
||||
__le64 ino, realm;
|
||||
__le64 cap_id;
|
||||
__le32 seq, issue_seq;
|
||||
__le32 caps, wanted, dirty; /* latest issued/wanted/dirty */
|
||||
__le32 migrate_seq;
|
||||
__le64 snap_follows;
|
||||
__le32 snap_trace_len;
|
||||
|
||||
/* authlock */
|
||||
__le32 uid, gid, mode;
|
||||
|
||||
/* linklock */
|
||||
__le32 nlink;
|
||||
|
||||
/* xattrlock */
|
||||
__le32 xattr_len;
|
||||
__le64 xattr_version;
|
||||
|
||||
/* filelock */
|
||||
__le64 size, max_size, truncate_size;
|
||||
__le32 truncate_seq;
|
||||
struct ceph_timespec mtime, atime, ctime;
|
||||
struct ceph_file_layout layout;
|
||||
__le32 time_warp_seq;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/* cap release msg head */
|
||||
struct ceph_mds_cap_release {
|
||||
__le32 num; /* number of cap_items that follow */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_mds_cap_item {
|
||||
__le64 ino;
|
||||
__le64 cap_id;
|
||||
__le32 migrate_seq, seq;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MDS_LEASE_REVOKE 1 /* mds -> client */
|
||||
#define CEPH_MDS_LEASE_RELEASE 2 /* client -> mds */
|
||||
#define CEPH_MDS_LEASE_RENEW 3 /* client <-> mds */
|
||||
#define CEPH_MDS_LEASE_REVOKE_ACK 4 /* client -> mds */
|
||||
|
||||
extern const char *ceph_lease_op_name(int o);
|
||||
|
||||
/* lease msg header */
|
||||
struct ceph_mds_lease {
|
||||
__u8 action; /* CEPH_MDS_LEASE_* */
|
||||
__le16 mask; /* which lease */
|
||||
__le64 ino;
|
||||
__le64 first, last; /* snap range */
|
||||
__le32 seq;
|
||||
__le32 duration_ms; /* duration of renewal */
|
||||
} __attribute__ ((packed));
|
||||
/* followed by a __le32+string for dname */
|
||||
|
||||
/* client reconnect */
|
||||
struct ceph_mds_cap_reconnect {
|
||||
__le64 cap_id;
|
||||
__le32 wanted;
|
||||
__le32 issued;
|
||||
__le64 size;
|
||||
struct ceph_timespec mtime, atime;
|
||||
__le64 snaprealm;
|
||||
__le64 pathbase; /* base ino for our path to this ino */
|
||||
} __attribute__ ((packed));
|
||||
/* followed by encoded string */
|
||||
|
||||
struct ceph_mds_snaprealm_reconnect {
|
||||
__le64 ino; /* snap realm base */
|
||||
__le64 seq; /* snap seq for this snap realm */
|
||||
__le64 parent; /* parent realm */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* snaps
|
||||
*/
|
||||
enum {
|
||||
CEPH_SNAP_OP_UPDATE, /* CREATE or DESTROY */
|
||||
CEPH_SNAP_OP_CREATE,
|
||||
CEPH_SNAP_OP_DESTROY,
|
||||
CEPH_SNAP_OP_SPLIT,
|
||||
};
|
||||
|
||||
extern const char *ceph_snap_op_name(int o);
|
||||
|
||||
/* snap msg header */
|
||||
struct ceph_mds_snap_head {
|
||||
__le32 op; /* CEPH_SNAP_OP_* */
|
||||
__le64 split; /* ino to split off, if any */
|
||||
__le32 num_split_inos; /* # inos belonging to new child realm */
|
||||
__le32 num_split_realms; /* # child realms udner new child realm */
|
||||
__le32 trace_len; /* size of snap trace blob */
|
||||
} __attribute__ ((packed));
|
||||
/* followed by split ino list, then split realms, then the trace blob */
|
||||
|
||||
/*
|
||||
* encode info about a snaprealm, as viewed by a client
|
||||
*/
|
||||
struct ceph_mds_snap_realm {
|
||||
__le64 ino; /* ino */
|
||||
__le64 created; /* snap: when created */
|
||||
__le64 parent; /* ino: parent realm */
|
||||
__le64 parent_since; /* snap: same parent since */
|
||||
__le64 seq; /* snap: version */
|
||||
__le32 num_snaps;
|
||||
__le32 num_prior_parent_snaps;
|
||||
} __attribute__ ((packed));
|
||||
/* followed by my snap list, then prior parent snap list */
|
||||
|
||||
#endif
|
118
fs/ceph/ceph_hash.c
Normal file
118
fs/ceph/ceph_hash.c
Normal file
@ -0,0 +1,118 @@
|
||||
|
||||
#include "types.h"
|
||||
|
||||
/*
|
||||
* Robert Jenkin's hash function.
|
||||
* http://burtleburtle.net/bob/hash/evahash.html
|
||||
* This is in the public domain.
|
||||
*/
|
||||
#define mix(a, b, c) \
|
||||
do { \
|
||||
a = a - b; a = a - c; a = a ^ (c >> 13); \
|
||||
b = b - c; b = b - a; b = b ^ (a << 8); \
|
||||
c = c - a; c = c - b; c = c ^ (b >> 13); \
|
||||
a = a - b; a = a - c; a = a ^ (c >> 12); \
|
||||
b = b - c; b = b - a; b = b ^ (a << 16); \
|
||||
c = c - a; c = c - b; c = c ^ (b >> 5); \
|
||||
a = a - b; a = a - c; a = a ^ (c >> 3); \
|
||||
b = b - c; b = b - a; b = b ^ (a << 10); \
|
||||
c = c - a; c = c - b; c = c ^ (b >> 15); \
|
||||
} while (0)
|
||||
|
||||
unsigned ceph_str_hash_rjenkins(const char *str, unsigned length)
|
||||
{
|
||||
const unsigned char *k = (const unsigned char *)str;
|
||||
__u32 a, b, c; /* the internal state */
|
||||
__u32 len; /* how many key bytes still need mixing */
|
||||
|
||||
/* Set up the internal state */
|
||||
len = length;
|
||||
a = 0x9e3779b9; /* the golden ratio; an arbitrary value */
|
||||
b = a;
|
||||
c = 0; /* variable initialization of internal state */
|
||||
|
||||
/* handle most of the key */
|
||||
while (len >= 12) {
|
||||
a = a + (k[0] + ((__u32)k[1] << 8) + ((__u32)k[2] << 16) +
|
||||
((__u32)k[3] << 24));
|
||||
b = b + (k[4] + ((__u32)k[5] << 8) + ((__u32)k[6] << 16) +
|
||||
((__u32)k[7] << 24));
|
||||
c = c + (k[8] + ((__u32)k[9] << 8) + ((__u32)k[10] << 16) +
|
||||
((__u32)k[11] << 24));
|
||||
mix(a, b, c);
|
||||
k = k + 12;
|
||||
len = len - 12;
|
||||
}
|
||||
|
||||
/* handle the last 11 bytes */
|
||||
c = c + length;
|
||||
switch (len) { /* all the case statements fall through */
|
||||
case 11:
|
||||
c = c + ((__u32)k[10] << 24);
|
||||
case 10:
|
||||
c = c + ((__u32)k[9] << 16);
|
||||
case 9:
|
||||
c = c + ((__u32)k[8] << 8);
|
||||
/* the first byte of c is reserved for the length */
|
||||
case 8:
|
||||
b = b + ((__u32)k[7] << 24);
|
||||
case 7:
|
||||
b = b + ((__u32)k[6] << 16);
|
||||
case 6:
|
||||
b = b + ((__u32)k[5] << 8);
|
||||
case 5:
|
||||
b = b + k[4];
|
||||
case 4:
|
||||
a = a + ((__u32)k[3] << 24);
|
||||
case 3:
|
||||
a = a + ((__u32)k[2] << 16);
|
||||
case 2:
|
||||
a = a + ((__u32)k[1] << 8);
|
||||
case 1:
|
||||
a = a + k[0];
|
||||
/* case 0: nothing left to add */
|
||||
}
|
||||
mix(a, b, c);
|
||||
|
||||
return c;
|
||||
}
|
||||
|
||||
/*
|
||||
* linux dcache hash
|
||||
*/
|
||||
unsigned ceph_str_hash_linux(const char *str, unsigned length)
|
||||
{
|
||||
unsigned long hash = 0;
|
||||
unsigned char c;
|
||||
|
||||
while (length--) {
|
||||
c = *str++;
|
||||
hash = (hash + (c << 4) + (c >> 4)) * 11;
|
||||
}
|
||||
return hash;
|
||||
}
|
||||
|
||||
|
||||
unsigned ceph_str_hash(int type, const char *s, unsigned len)
|
||||
{
|
||||
switch (type) {
|
||||
case CEPH_STR_HASH_LINUX:
|
||||
return ceph_str_hash_linux(s, len);
|
||||
case CEPH_STR_HASH_RJENKINS:
|
||||
return ceph_str_hash_rjenkins(s, len);
|
||||
default:
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
|
||||
const char *ceph_str_hash_name(int type)
|
||||
{
|
||||
switch (type) {
|
||||
case CEPH_STR_HASH_LINUX:
|
||||
return "linux";
|
||||
case CEPH_STR_HASH_RJENKINS:
|
||||
return "rjenkins";
|
||||
default:
|
||||
return "unknown";
|
||||
}
|
||||
}
|
13
fs/ceph/ceph_hash.h
Normal file
13
fs/ceph/ceph_hash.h
Normal file
@ -0,0 +1,13 @@
|
||||
#ifndef _FS_CEPH_HASH_H
|
||||
#define _FS_CEPH_HASH_H
|
||||
|
||||
#define CEPH_STR_HASH_LINUX 0x1 /* linux dcache hash */
|
||||
#define CEPH_STR_HASH_RJENKINS 0x2 /* robert jenkins' */
|
||||
|
||||
extern unsigned ceph_str_hash_linux(const char *s, unsigned len);
|
||||
extern unsigned ceph_str_hash_rjenkins(const char *s, unsigned len);
|
||||
|
||||
extern unsigned ceph_str_hash(int type, const char *s, unsigned len);
|
||||
extern const char *ceph_str_hash_name(int type);
|
||||
|
||||
#endif
|
176
fs/ceph/ceph_strings.c
Normal file
176
fs/ceph/ceph_strings.c
Normal file
@ -0,0 +1,176 @@
|
||||
/*
|
||||
* Ceph string constants
|
||||
*/
|
||||
#include "types.h"
|
||||
|
||||
const char *ceph_entity_type_name(int type)
|
||||
{
|
||||
switch (type) {
|
||||
case CEPH_ENTITY_TYPE_MDS: return "mds";
|
||||
case CEPH_ENTITY_TYPE_OSD: return "osd";
|
||||
case CEPH_ENTITY_TYPE_MON: return "mon";
|
||||
case CEPH_ENTITY_TYPE_CLIENT: return "client";
|
||||
case CEPH_ENTITY_TYPE_ADMIN: return "admin";
|
||||
case CEPH_ENTITY_TYPE_AUTH: return "auth";
|
||||
default: return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
const char *ceph_osd_op_name(int op)
|
||||
{
|
||||
switch (op) {
|
||||
case CEPH_OSD_OP_READ: return "read";
|
||||
case CEPH_OSD_OP_STAT: return "stat";
|
||||
|
||||
case CEPH_OSD_OP_MASKTRUNC: return "masktrunc";
|
||||
|
||||
case CEPH_OSD_OP_WRITE: return "write";
|
||||
case CEPH_OSD_OP_DELETE: return "delete";
|
||||
case CEPH_OSD_OP_TRUNCATE: return "truncate";
|
||||
case CEPH_OSD_OP_ZERO: return "zero";
|
||||
case CEPH_OSD_OP_WRITEFULL: return "writefull";
|
||||
|
||||
case CEPH_OSD_OP_APPEND: return "append";
|
||||
case CEPH_OSD_OP_STARTSYNC: return "startsync";
|
||||
case CEPH_OSD_OP_SETTRUNC: return "settrunc";
|
||||
case CEPH_OSD_OP_TRIMTRUNC: return "trimtrunc";
|
||||
|
||||
case CEPH_OSD_OP_TMAPUP: return "tmapup";
|
||||
case CEPH_OSD_OP_TMAPGET: return "tmapget";
|
||||
case CEPH_OSD_OP_TMAPPUT: return "tmapput";
|
||||
|
||||
case CEPH_OSD_OP_GETXATTR: return "getxattr";
|
||||
case CEPH_OSD_OP_GETXATTRS: return "getxattrs";
|
||||
case CEPH_OSD_OP_SETXATTR: return "setxattr";
|
||||
case CEPH_OSD_OP_SETXATTRS: return "setxattrs";
|
||||
case CEPH_OSD_OP_RESETXATTRS: return "resetxattrs";
|
||||
case CEPH_OSD_OP_RMXATTR: return "rmxattr";
|
||||
|
||||
case CEPH_OSD_OP_PULL: return "pull";
|
||||
case CEPH_OSD_OP_PUSH: return "push";
|
||||
case CEPH_OSD_OP_BALANCEREADS: return "balance-reads";
|
||||
case CEPH_OSD_OP_UNBALANCEREADS: return "unbalance-reads";
|
||||
case CEPH_OSD_OP_SCRUB: return "scrub";
|
||||
|
||||
case CEPH_OSD_OP_WRLOCK: return "wrlock";
|
||||
case CEPH_OSD_OP_WRUNLOCK: return "wrunlock";
|
||||
case CEPH_OSD_OP_RDLOCK: return "rdlock";
|
||||
case CEPH_OSD_OP_RDUNLOCK: return "rdunlock";
|
||||
case CEPH_OSD_OP_UPLOCK: return "uplock";
|
||||
case CEPH_OSD_OP_DNLOCK: return "dnlock";
|
||||
|
||||
case CEPH_OSD_OP_CALL: return "call";
|
||||
|
||||
case CEPH_OSD_OP_PGLS: return "pgls";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_mds_state_name(int s)
|
||||
{
|
||||
switch (s) {
|
||||
/* down and out */
|
||||
case CEPH_MDS_STATE_DNE: return "down:dne";
|
||||
case CEPH_MDS_STATE_STOPPED: return "down:stopped";
|
||||
/* up and out */
|
||||
case CEPH_MDS_STATE_BOOT: return "up:boot";
|
||||
case CEPH_MDS_STATE_STANDBY: return "up:standby";
|
||||
case CEPH_MDS_STATE_STANDBY_REPLAY: return "up:standby-replay";
|
||||
case CEPH_MDS_STATE_CREATING: return "up:creating";
|
||||
case CEPH_MDS_STATE_STARTING: return "up:starting";
|
||||
/* up and in */
|
||||
case CEPH_MDS_STATE_REPLAY: return "up:replay";
|
||||
case CEPH_MDS_STATE_RESOLVE: return "up:resolve";
|
||||
case CEPH_MDS_STATE_RECONNECT: return "up:reconnect";
|
||||
case CEPH_MDS_STATE_REJOIN: return "up:rejoin";
|
||||
case CEPH_MDS_STATE_CLIENTREPLAY: return "up:clientreplay";
|
||||
case CEPH_MDS_STATE_ACTIVE: return "up:active";
|
||||
case CEPH_MDS_STATE_STOPPING: return "up:stopping";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_session_op_name(int op)
|
||||
{
|
||||
switch (op) {
|
||||
case CEPH_SESSION_REQUEST_OPEN: return "request_open";
|
||||
case CEPH_SESSION_OPEN: return "open";
|
||||
case CEPH_SESSION_REQUEST_CLOSE: return "request_close";
|
||||
case CEPH_SESSION_CLOSE: return "close";
|
||||
case CEPH_SESSION_REQUEST_RENEWCAPS: return "request_renewcaps";
|
||||
case CEPH_SESSION_RENEWCAPS: return "renewcaps";
|
||||
case CEPH_SESSION_STALE: return "stale";
|
||||
case CEPH_SESSION_RECALL_STATE: return "recall_state";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_mds_op_name(int op)
|
||||
{
|
||||
switch (op) {
|
||||
case CEPH_MDS_OP_LOOKUP: return "lookup";
|
||||
case CEPH_MDS_OP_LOOKUPHASH: return "lookuphash";
|
||||
case CEPH_MDS_OP_LOOKUPPARENT: return "lookupparent";
|
||||
case CEPH_MDS_OP_GETATTR: return "getattr";
|
||||
case CEPH_MDS_OP_SETXATTR: return "setxattr";
|
||||
case CEPH_MDS_OP_SETATTR: return "setattr";
|
||||
case CEPH_MDS_OP_RMXATTR: return "rmxattr";
|
||||
case CEPH_MDS_OP_READDIR: return "readdir";
|
||||
case CEPH_MDS_OP_MKNOD: return "mknod";
|
||||
case CEPH_MDS_OP_LINK: return "link";
|
||||
case CEPH_MDS_OP_UNLINK: return "unlink";
|
||||
case CEPH_MDS_OP_RENAME: return "rename";
|
||||
case CEPH_MDS_OP_MKDIR: return "mkdir";
|
||||
case CEPH_MDS_OP_RMDIR: return "rmdir";
|
||||
case CEPH_MDS_OP_SYMLINK: return "symlink";
|
||||
case CEPH_MDS_OP_CREATE: return "create";
|
||||
case CEPH_MDS_OP_OPEN: return "open";
|
||||
case CEPH_MDS_OP_LOOKUPSNAP: return "lookupsnap";
|
||||
case CEPH_MDS_OP_LSSNAP: return "lssnap";
|
||||
case CEPH_MDS_OP_MKSNAP: return "mksnap";
|
||||
case CEPH_MDS_OP_RMSNAP: return "rmsnap";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_cap_op_name(int op)
|
||||
{
|
||||
switch (op) {
|
||||
case CEPH_CAP_OP_GRANT: return "grant";
|
||||
case CEPH_CAP_OP_REVOKE: return "revoke";
|
||||
case CEPH_CAP_OP_TRUNC: return "trunc";
|
||||
case CEPH_CAP_OP_EXPORT: return "export";
|
||||
case CEPH_CAP_OP_IMPORT: return "import";
|
||||
case CEPH_CAP_OP_UPDATE: return "update";
|
||||
case CEPH_CAP_OP_DROP: return "drop";
|
||||
case CEPH_CAP_OP_FLUSH: return "flush";
|
||||
case CEPH_CAP_OP_FLUSH_ACK: return "flush_ack";
|
||||
case CEPH_CAP_OP_FLUSHSNAP: return "flushsnap";
|
||||
case CEPH_CAP_OP_FLUSHSNAP_ACK: return "flushsnap_ack";
|
||||
case CEPH_CAP_OP_RELEASE: return "release";
|
||||
case CEPH_CAP_OP_RENEW: return "renew";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_lease_op_name(int o)
|
||||
{
|
||||
switch (o) {
|
||||
case CEPH_MDS_LEASE_REVOKE: return "revoke";
|
||||
case CEPH_MDS_LEASE_RELEASE: return "release";
|
||||
case CEPH_MDS_LEASE_RENEW: return "renew";
|
||||
case CEPH_MDS_LEASE_REVOKE_ACK: return "revoke_ack";
|
||||
}
|
||||
return "???";
|
||||
}
|
||||
|
||||
const char *ceph_snap_op_name(int o)
|
||||
{
|
||||
switch (o) {
|
||||
case CEPH_SNAP_OP_UPDATE: return "update";
|
||||
case CEPH_SNAP_OP_CREATE: return "create";
|
||||
case CEPH_SNAP_OP_DESTROY: return "destroy";
|
||||
case CEPH_SNAP_OP_SPLIT: return "split";
|
||||
}
|
||||
return "???";
|
||||
}
|
151
fs/ceph/crush/crush.c
Normal file
151
fs/ceph/crush/crush.c
Normal file
@ -0,0 +1,151 @@
|
||||
|
||||
#ifdef __KERNEL__
|
||||
# include <linux/slab.h>
|
||||
#else
|
||||
# include <stdlib.h>
|
||||
# include <assert.h>
|
||||
# define kfree(x) do { if (x) free(x); } while (0)
|
||||
# define BUG_ON(x) assert(!(x))
|
||||
#endif
|
||||
|
||||
#include "crush.h"
|
||||
|
||||
const char *crush_bucket_alg_name(int alg)
|
||||
{
|
||||
switch (alg) {
|
||||
case CRUSH_BUCKET_UNIFORM: return "uniform";
|
||||
case CRUSH_BUCKET_LIST: return "list";
|
||||
case CRUSH_BUCKET_TREE: return "tree";
|
||||
case CRUSH_BUCKET_STRAW: return "straw";
|
||||
default: return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* crush_get_bucket_item_weight - Get weight of an item in given bucket
|
||||
* @b: bucket pointer
|
||||
* @p: item index in bucket
|
||||
*/
|
||||
int crush_get_bucket_item_weight(struct crush_bucket *b, int p)
|
||||
{
|
||||
if (p >= b->size)
|
||||
return 0;
|
||||
|
||||
switch (b->alg) {
|
||||
case CRUSH_BUCKET_UNIFORM:
|
||||
return ((struct crush_bucket_uniform *)b)->item_weight;
|
||||
case CRUSH_BUCKET_LIST:
|
||||
return ((struct crush_bucket_list *)b)->item_weights[p];
|
||||
case CRUSH_BUCKET_TREE:
|
||||
if (p & 1)
|
||||
return ((struct crush_bucket_tree *)b)->node_weights[p];
|
||||
return 0;
|
||||
case CRUSH_BUCKET_STRAW:
|
||||
return ((struct crush_bucket_straw *)b)->item_weights[p];
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* crush_calc_parents - Calculate parent vectors for the given crush map.
|
||||
* @map: crush_map pointer
|
||||
*/
|
||||
void crush_calc_parents(struct crush_map *map)
|
||||
{
|
||||
int i, b, c;
|
||||
|
||||
for (b = 0; b < map->max_buckets; b++) {
|
||||
if (map->buckets[b] == NULL)
|
||||
continue;
|
||||
for (i = 0; i < map->buckets[b]->size; i++) {
|
||||
c = map->buckets[b]->items[i];
|
||||
BUG_ON(c >= map->max_devices ||
|
||||
c < -map->max_buckets);
|
||||
if (c >= 0)
|
||||
map->device_parents[c] = map->buckets[b]->id;
|
||||
else
|
||||
map->bucket_parents[-1-c] = map->buckets[b]->id;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b)
|
||||
{
|
||||
kfree(b->h.perm);
|
||||
kfree(b->h.items);
|
||||
kfree(b);
|
||||
}
|
||||
|
||||
void crush_destroy_bucket_list(struct crush_bucket_list *b)
|
||||
{
|
||||
kfree(b->item_weights);
|
||||
kfree(b->sum_weights);
|
||||
kfree(b->h.perm);
|
||||
kfree(b->h.items);
|
||||
kfree(b);
|
||||
}
|
||||
|
||||
void crush_destroy_bucket_tree(struct crush_bucket_tree *b)
|
||||
{
|
||||
kfree(b->node_weights);
|
||||
kfree(b);
|
||||
}
|
||||
|
||||
void crush_destroy_bucket_straw(struct crush_bucket_straw *b)
|
||||
{
|
||||
kfree(b->straws);
|
||||
kfree(b->item_weights);
|
||||
kfree(b->h.perm);
|
||||
kfree(b->h.items);
|
||||
kfree(b);
|
||||
}
|
||||
|
||||
void crush_destroy_bucket(struct crush_bucket *b)
|
||||
{
|
||||
switch (b->alg) {
|
||||
case CRUSH_BUCKET_UNIFORM:
|
||||
crush_destroy_bucket_uniform((struct crush_bucket_uniform *)b);
|
||||
break;
|
||||
case CRUSH_BUCKET_LIST:
|
||||
crush_destroy_bucket_list((struct crush_bucket_list *)b);
|
||||
break;
|
||||
case CRUSH_BUCKET_TREE:
|
||||
crush_destroy_bucket_tree((struct crush_bucket_tree *)b);
|
||||
break;
|
||||
case CRUSH_BUCKET_STRAW:
|
||||
crush_destroy_bucket_straw((struct crush_bucket_straw *)b);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* crush_destroy - Destroy a crush_map
|
||||
* @map: crush_map pointer
|
||||
*/
|
||||
void crush_destroy(struct crush_map *map)
|
||||
{
|
||||
int b;
|
||||
|
||||
/* buckets */
|
||||
if (map->buckets) {
|
||||
for (b = 0; b < map->max_buckets; b++) {
|
||||
if (map->buckets[b] == NULL)
|
||||
continue;
|
||||
crush_destroy_bucket(map->buckets[b]);
|
||||
}
|
||||
kfree(map->buckets);
|
||||
}
|
||||
|
||||
/* rules */
|
||||
if (map->rules) {
|
||||
for (b = 0; b < map->max_rules; b++)
|
||||
kfree(map->rules[b]);
|
||||
kfree(map->rules);
|
||||
}
|
||||
|
||||
kfree(map->bucket_parents);
|
||||
kfree(map->device_parents);
|
||||
kfree(map);
|
||||
}
|
||||
|
||||
|
180
fs/ceph/crush/crush.h
Normal file
180
fs/ceph/crush/crush.h
Normal file
@ -0,0 +1,180 @@
|
||||
#ifndef _CRUSH_CRUSH_H
|
||||
#define _CRUSH_CRUSH_H
|
||||
|
||||
#include <linux/types.h>
|
||||
|
||||
/*
|
||||
* CRUSH is a pseudo-random data distribution algorithm that
|
||||
* efficiently distributes input values (typically, data objects)
|
||||
* across a heterogeneous, structured storage cluster.
|
||||
*
|
||||
* The algorithm was originally described in detail in this paper
|
||||
* (although the algorithm has evolved somewhat since then):
|
||||
*
|
||||
* http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf
|
||||
*
|
||||
* LGPL2
|
||||
*/
|
||||
|
||||
|
||||
#define CRUSH_MAGIC 0x00010000ul /* for detecting algorithm revisions */
|
||||
|
||||
|
||||
#define CRUSH_MAX_DEPTH 10 /* max crush hierarchy depth */
|
||||
#define CRUSH_MAX_SET 10 /* max size of a mapping result */
|
||||
|
||||
|
||||
/*
|
||||
* CRUSH uses user-defined "rules" to describe how inputs should be
|
||||
* mapped to devices. A rule consists of sequence of steps to perform
|
||||
* to generate the set of output devices.
|
||||
*/
|
||||
struct crush_rule_step {
|
||||
__u32 op;
|
||||
__s32 arg1;
|
||||
__s32 arg2;
|
||||
};
|
||||
|
||||
/* step op codes */
|
||||
enum {
|
||||
CRUSH_RULE_NOOP = 0,
|
||||
CRUSH_RULE_TAKE = 1, /* arg1 = value to start with */
|
||||
CRUSH_RULE_CHOOSE_FIRSTN = 2, /* arg1 = num items to pick */
|
||||
/* arg2 = type */
|
||||
CRUSH_RULE_CHOOSE_INDEP = 3, /* same */
|
||||
CRUSH_RULE_EMIT = 4, /* no args */
|
||||
CRUSH_RULE_CHOOSE_LEAF_FIRSTN = 6,
|
||||
CRUSH_RULE_CHOOSE_LEAF_INDEP = 7,
|
||||
};
|
||||
|
||||
/*
|
||||
* for specifying choose num (arg1) relative to the max parameter
|
||||
* passed to do_rule
|
||||
*/
|
||||
#define CRUSH_CHOOSE_N 0
|
||||
#define CRUSH_CHOOSE_N_MINUS(x) (-(x))
|
||||
|
||||
/*
|
||||
* The rule mask is used to describe what the rule is intended for.
|
||||
* Given a ruleset and size of output set, we search through the
|
||||
* rule list for a matching rule_mask.
|
||||
*/
|
||||
struct crush_rule_mask {
|
||||
__u8 ruleset;
|
||||
__u8 type;
|
||||
__u8 min_size;
|
||||
__u8 max_size;
|
||||
};
|
||||
|
||||
struct crush_rule {
|
||||
__u32 len;
|
||||
struct crush_rule_mask mask;
|
||||
struct crush_rule_step steps[0];
|
||||
};
|
||||
|
||||
#define crush_rule_size(len) (sizeof(struct crush_rule) + \
|
||||
(len)*sizeof(struct crush_rule_step))
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* A bucket is a named container of other items (either devices or
|
||||
* other buckets). Items within a bucket are chosen using one of a
|
||||
* few different algorithms. The table summarizes how the speed of
|
||||
* each option measures up against mapping stability when items are
|
||||
* added or removed.
|
||||
*
|
||||
* Bucket Alg Speed Additions Removals
|
||||
* ------------------------------------------------
|
||||
* uniform O(1) poor poor
|
||||
* list O(n) optimal poor
|
||||
* tree O(log n) good good
|
||||
* straw O(n) optimal optimal
|
||||
*/
|
||||
enum {
|
||||
CRUSH_BUCKET_UNIFORM = 1,
|
||||
CRUSH_BUCKET_LIST = 2,
|
||||
CRUSH_BUCKET_TREE = 3,
|
||||
CRUSH_BUCKET_STRAW = 4
|
||||
};
|
||||
extern const char *crush_bucket_alg_name(int alg);
|
||||
|
||||
struct crush_bucket {
|
||||
__s32 id; /* this'll be negative */
|
||||
__u16 type; /* non-zero; type=0 is reserved for devices */
|
||||
__u8 alg; /* one of CRUSH_BUCKET_* */
|
||||
__u8 hash; /* which hash function to use, CRUSH_HASH_* */
|
||||
__u32 weight; /* 16-bit fixed point */
|
||||
__u32 size; /* num items */
|
||||
__s32 *items;
|
||||
|
||||
/*
|
||||
* cached random permutation: used for uniform bucket and for
|
||||
* the linear search fallback for the other bucket types.
|
||||
*/
|
||||
__u32 perm_x; /* @x for which *perm is defined */
|
||||
__u32 perm_n; /* num elements of *perm that are permuted/defined */
|
||||
__u32 *perm;
|
||||
};
|
||||
|
||||
struct crush_bucket_uniform {
|
||||
struct crush_bucket h;
|
||||
__u32 item_weight; /* 16-bit fixed point; all items equally weighted */
|
||||
};
|
||||
|
||||
struct crush_bucket_list {
|
||||
struct crush_bucket h;
|
||||
__u32 *item_weights; /* 16-bit fixed point */
|
||||
__u32 *sum_weights; /* 16-bit fixed point. element i is sum
|
||||
of weights 0..i, inclusive */
|
||||
};
|
||||
|
||||
struct crush_bucket_tree {
|
||||
struct crush_bucket h; /* note: h.size is _tree_ size, not number of
|
||||
actual items */
|
||||
__u8 num_nodes;
|
||||
__u32 *node_weights;
|
||||
};
|
||||
|
||||
struct crush_bucket_straw {
|
||||
struct crush_bucket h;
|
||||
__u32 *item_weights; /* 16-bit fixed point */
|
||||
__u32 *straws; /* 16-bit fixed point */
|
||||
};
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* CRUSH map includes all buckets, rules, etc.
|
||||
*/
|
||||
struct crush_map {
|
||||
struct crush_bucket **buckets;
|
||||
struct crush_rule **rules;
|
||||
|
||||
/*
|
||||
* Parent pointers to identify the parent bucket a device or
|
||||
* bucket in the hierarchy. If an item appears more than
|
||||
* once, this is the _last_ time it appeared (where buckets
|
||||
* are processed in bucket id order, from -1 on down to
|
||||
* -max_buckets.
|
||||
*/
|
||||
__u32 *bucket_parents;
|
||||
__u32 *device_parents;
|
||||
|
||||
__s32 max_buckets;
|
||||
__u32 max_rules;
|
||||
__s32 max_devices;
|
||||
};
|
||||
|
||||
|
||||
/* crush.c */
|
||||
extern int crush_get_bucket_item_weight(struct crush_bucket *b, int pos);
|
||||
extern void crush_calc_parents(struct crush_map *map);
|
||||
extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b);
|
||||
extern void crush_destroy_bucket_list(struct crush_bucket_list *b);
|
||||
extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b);
|
||||
extern void crush_destroy_bucket_straw(struct crush_bucket_straw *b);
|
||||
extern void crush_destroy_bucket(struct crush_bucket *b);
|
||||
extern void crush_destroy(struct crush_map *map);
|
||||
|
||||
#endif
|
149
fs/ceph/crush/hash.c
Normal file
149
fs/ceph/crush/hash.c
Normal file
@ -0,0 +1,149 @@
|
||||
|
||||
#include <linux/types.h>
|
||||
#include "hash.h"
|
||||
|
||||
/*
|
||||
* Robert Jenkins' function for mixing 32-bit values
|
||||
* http://burtleburtle.net/bob/hash/evahash.html
|
||||
* a, b = random bits, c = input and output
|
||||
*/
|
||||
#define crush_hashmix(a, b, c) do { \
|
||||
a = a-b; a = a-c; a = a^(c>>13); \
|
||||
b = b-c; b = b-a; b = b^(a<<8); \
|
||||
c = c-a; c = c-b; c = c^(b>>13); \
|
||||
a = a-b; a = a-c; a = a^(c>>12); \
|
||||
b = b-c; b = b-a; b = b^(a<<16); \
|
||||
c = c-a; c = c-b; c = c^(b>>5); \
|
||||
a = a-b; a = a-c; a = a^(c>>3); \
|
||||
b = b-c; b = b-a; b = b^(a<<10); \
|
||||
c = c-a; c = c-b; c = c^(b>>15); \
|
||||
} while (0)
|
||||
|
||||
#define crush_hash_seed 1315423911
|
||||
|
||||
static __u32 crush_hash32_rjenkins1(__u32 a)
|
||||
{
|
||||
__u32 hash = crush_hash_seed ^ a;
|
||||
__u32 b = a;
|
||||
__u32 x = 231232;
|
||||
__u32 y = 1232;
|
||||
crush_hashmix(b, x, hash);
|
||||
crush_hashmix(y, a, hash);
|
||||
return hash;
|
||||
}
|
||||
|
||||
static __u32 crush_hash32_rjenkins1_2(__u32 a, __u32 b)
|
||||
{
|
||||
__u32 hash = crush_hash_seed ^ a ^ b;
|
||||
__u32 x = 231232;
|
||||
__u32 y = 1232;
|
||||
crush_hashmix(a, b, hash);
|
||||
crush_hashmix(x, a, hash);
|
||||
crush_hashmix(b, y, hash);
|
||||
return hash;
|
||||
}
|
||||
|
||||
static __u32 crush_hash32_rjenkins1_3(__u32 a, __u32 b, __u32 c)
|
||||
{
|
||||
__u32 hash = crush_hash_seed ^ a ^ b ^ c;
|
||||
__u32 x = 231232;
|
||||
__u32 y = 1232;
|
||||
crush_hashmix(a, b, hash);
|
||||
crush_hashmix(c, x, hash);
|
||||
crush_hashmix(y, a, hash);
|
||||
crush_hashmix(b, x, hash);
|
||||
crush_hashmix(y, c, hash);
|
||||
return hash;
|
||||
}
|
||||
|
||||
static __u32 crush_hash32_rjenkins1_4(__u32 a, __u32 b, __u32 c, __u32 d)
|
||||
{
|
||||
__u32 hash = crush_hash_seed ^ a ^ b ^ c ^ d;
|
||||
__u32 x = 231232;
|
||||
__u32 y = 1232;
|
||||
crush_hashmix(a, b, hash);
|
||||
crush_hashmix(c, d, hash);
|
||||
crush_hashmix(a, x, hash);
|
||||
crush_hashmix(y, b, hash);
|
||||
crush_hashmix(c, x, hash);
|
||||
crush_hashmix(y, d, hash);
|
||||
return hash;
|
||||
}
|
||||
|
||||
static __u32 crush_hash32_rjenkins1_5(__u32 a, __u32 b, __u32 c, __u32 d,
|
||||
__u32 e)
|
||||
{
|
||||
__u32 hash = crush_hash_seed ^ a ^ b ^ c ^ d ^ e;
|
||||
__u32 x = 231232;
|
||||
__u32 y = 1232;
|
||||
crush_hashmix(a, b, hash);
|
||||
crush_hashmix(c, d, hash);
|
||||
crush_hashmix(e, x, hash);
|
||||
crush_hashmix(y, a, hash);
|
||||
crush_hashmix(b, x, hash);
|
||||
crush_hashmix(y, c, hash);
|
||||
crush_hashmix(d, x, hash);
|
||||
crush_hashmix(y, e, hash);
|
||||
return hash;
|
||||
}
|
||||
|
||||
|
||||
__u32 crush_hash32(int type, __u32 a)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return crush_hash32_rjenkins1(a);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
__u32 crush_hash32_2(int type, __u32 a, __u32 b)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return crush_hash32_rjenkins1_2(a, b);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
__u32 crush_hash32_3(int type, __u32 a, __u32 b, __u32 c)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return crush_hash32_rjenkins1_3(a, b, c);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
__u32 crush_hash32_4(int type, __u32 a, __u32 b, __u32 c, __u32 d)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return crush_hash32_rjenkins1_4(a, b, c, d);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
__u32 crush_hash32_5(int type, __u32 a, __u32 b, __u32 c, __u32 d, __u32 e)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return crush_hash32_rjenkins1_5(a, b, c, d, e);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
const char *crush_hash_name(int type)
|
||||
{
|
||||
switch (type) {
|
||||
case CRUSH_HASH_RJENKINS1:
|
||||
return "rjenkins1";
|
||||
default:
|
||||
return "unknown";
|
||||
}
|
||||
}
|
17
fs/ceph/crush/hash.h
Normal file
17
fs/ceph/crush/hash.h
Normal file
@ -0,0 +1,17 @@
|
||||
#ifndef _CRUSH_HASH_H
|
||||
#define _CRUSH_HASH_H
|
||||
|
||||
#define CRUSH_HASH_RJENKINS1 0
|
||||
|
||||
#define CRUSH_HASH_DEFAULT CRUSH_HASH_RJENKINS1
|
||||
|
||||
extern const char *crush_hash_name(int type);
|
||||
|
||||
extern __u32 crush_hash32(int type, __u32 a);
|
||||
extern __u32 crush_hash32_2(int type, __u32 a, __u32 b);
|
||||
extern __u32 crush_hash32_3(int type, __u32 a, __u32 b, __u32 c);
|
||||
extern __u32 crush_hash32_4(int type, __u32 a, __u32 b, __u32 c, __u32 d);
|
||||
extern __u32 crush_hash32_5(int type, __u32 a, __u32 b, __u32 c, __u32 d,
|
||||
__u32 e);
|
||||
|
||||
#endif
|
596
fs/ceph/crush/mapper.c
Normal file
596
fs/ceph/crush/mapper.c
Normal file
@ -0,0 +1,596 @@
|
||||
|
||||
#ifdef __KERNEL__
|
||||
# include <linux/string.h>
|
||||
# include <linux/slab.h>
|
||||
# include <linux/bug.h>
|
||||
# include <linux/kernel.h>
|
||||
# ifndef dprintk
|
||||
# define dprintk(args...)
|
||||
# endif
|
||||
#else
|
||||
# include <string.h>
|
||||
# include <stdio.h>
|
||||
# include <stdlib.h>
|
||||
# include <assert.h>
|
||||
# define BUG_ON(x) assert(!(x))
|
||||
# define dprintk(args...) /* printf(args) */
|
||||
# define kmalloc(x, f) malloc(x)
|
||||
# define kfree(x) free(x)
|
||||
#endif
|
||||
|
||||
#include "crush.h"
|
||||
#include "hash.h"
|
||||
|
||||
/*
|
||||
* Implement the core CRUSH mapping algorithm.
|
||||
*/
|
||||
|
||||
/**
|
||||
* crush_find_rule - find a crush_rule id for a given ruleset, type, and size.
|
||||
* @map: the crush_map
|
||||
* @ruleset: the storage ruleset id (user defined)
|
||||
* @type: storage ruleset type (user defined)
|
||||
* @size: output set size
|
||||
*/
|
||||
int crush_find_rule(struct crush_map *map, int ruleset, int type, int size)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < map->max_rules; i++) {
|
||||
if (map->rules[i] &&
|
||||
map->rules[i]->mask.ruleset == ruleset &&
|
||||
map->rules[i]->mask.type == type &&
|
||||
map->rules[i]->mask.min_size <= size &&
|
||||
map->rules[i]->mask.max_size >= size)
|
||||
return i;
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* bucket choose methods
|
||||
*
|
||||
* For each bucket algorithm, we have a "choose" method that, given a
|
||||
* crush input @x and replica position (usually, position in output set) @r,
|
||||
* will produce an item in the bucket.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Choose based on a random permutation of the bucket.
|
||||
*
|
||||
* We used to use some prime number arithmetic to do this, but it
|
||||
* wasn't very random, and had some other bad behaviors. Instead, we
|
||||
* calculate an actual random permutation of the bucket members.
|
||||
* Since this is expensive, we optimize for the r=0 case, which
|
||||
* captures the vast majority of calls.
|
||||
*/
|
||||
static int bucket_perm_choose(struct crush_bucket *bucket,
|
||||
int x, int r)
|
||||
{
|
||||
unsigned pr = r % bucket->size;
|
||||
unsigned i, s;
|
||||
|
||||
/* start a new permutation if @x has changed */
|
||||
if (bucket->perm_x != x || bucket->perm_n == 0) {
|
||||
dprintk("bucket %d new x=%d\n", bucket->id, x);
|
||||
bucket->perm_x = x;
|
||||
|
||||
/* optimize common r=0 case */
|
||||
if (pr == 0) {
|
||||
s = crush_hash32_3(bucket->hash, x, bucket->id, 0) %
|
||||
bucket->size;
|
||||
bucket->perm[0] = s;
|
||||
bucket->perm_n = 0xffff; /* magic value, see below */
|
||||
goto out;
|
||||
}
|
||||
|
||||
for (i = 0; i < bucket->size; i++)
|
||||
bucket->perm[i] = i;
|
||||
bucket->perm_n = 0;
|
||||
} else if (bucket->perm_n == 0xffff) {
|
||||
/* clean up after the r=0 case above */
|
||||
for (i = 1; i < bucket->size; i++)
|
||||
bucket->perm[i] = i;
|
||||
bucket->perm[bucket->perm[0]] = 0;
|
||||
bucket->perm_n = 1;
|
||||
}
|
||||
|
||||
/* calculate permutation up to pr */
|
||||
for (i = 0; i < bucket->perm_n; i++)
|
||||
dprintk(" perm_choose have %d: %d\n", i, bucket->perm[i]);
|
||||
while (bucket->perm_n <= pr) {
|
||||
unsigned p = bucket->perm_n;
|
||||
/* no point in swapping the final entry */
|
||||
if (p < bucket->size - 1) {
|
||||
i = crush_hash32_3(bucket->hash, x, bucket->id, p) %
|
||||
(bucket->size - p);
|
||||
if (i) {
|
||||
unsigned t = bucket->perm[p + i];
|
||||
bucket->perm[p + i] = bucket->perm[p];
|
||||
bucket->perm[p] = t;
|
||||
}
|
||||
dprintk(" perm_choose swap %d with %d\n", p, p+i);
|
||||
}
|
||||
bucket->perm_n++;
|
||||
}
|
||||
for (i = 0; i < bucket->size; i++)
|
||||
dprintk(" perm_choose %d: %d\n", i, bucket->perm[i]);
|
||||
|
||||
s = bucket->perm[pr];
|
||||
out:
|
||||
dprintk(" perm_choose %d sz=%d x=%d r=%d (%d) s=%d\n", bucket->id,
|
||||
bucket->size, x, r, pr, s);
|
||||
return bucket->items[s];
|
||||
}
|
||||
|
||||
/* uniform */
|
||||
static int bucket_uniform_choose(struct crush_bucket_uniform *bucket,
|
||||
int x, int r)
|
||||
{
|
||||
return bucket_perm_choose(&bucket->h, x, r);
|
||||
}
|
||||
|
||||
/* list */
|
||||
static int bucket_list_choose(struct crush_bucket_list *bucket,
|
||||
int x, int r)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = bucket->h.size-1; i >= 0; i--) {
|
||||
__u64 w = crush_hash32_4(bucket->h.hash,x, bucket->h.items[i],
|
||||
r, bucket->h.id);
|
||||
w &= 0xffff;
|
||||
dprintk("list_choose i=%d x=%d r=%d item %d weight %x "
|
||||
"sw %x rand %llx",
|
||||
i, x, r, bucket->h.items[i], bucket->item_weights[i],
|
||||
bucket->sum_weights[i], w);
|
||||
w *= bucket->sum_weights[i];
|
||||
w = w >> 16;
|
||||
/*dprintk(" scaled %llx\n", w);*/
|
||||
if (w < bucket->item_weights[i])
|
||||
return bucket->h.items[i];
|
||||
}
|
||||
|
||||
BUG_ON(1);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/* (binary) tree */
|
||||
static int height(int n)
|
||||
{
|
||||
int h = 0;
|
||||
while ((n & 1) == 0) {
|
||||
h++;
|
||||
n = n >> 1;
|
||||
}
|
||||
return h;
|
||||
}
|
||||
|
||||
static int left(int x)
|
||||
{
|
||||
int h = height(x);
|
||||
return x - (1 << (h-1));
|
||||
}
|
||||
|
||||
static int right(int x)
|
||||
{
|
||||
int h = height(x);
|
||||
return x + (1 << (h-1));
|
||||
}
|
||||
|
||||
static int terminal(int x)
|
||||
{
|
||||
return x & 1;
|
||||
}
|
||||
|
||||
static int bucket_tree_choose(struct crush_bucket_tree *bucket,
|
||||
int x, int r)
|
||||
{
|
||||
int n, l;
|
||||
__u32 w;
|
||||
__u64 t;
|
||||
|
||||
/* start at root */
|
||||
n = bucket->num_nodes >> 1;
|
||||
|
||||
while (!terminal(n)) {
|
||||
/* pick point in [0, w) */
|
||||
w = bucket->node_weights[n];
|
||||
t = (__u64)crush_hash32_4(bucket->h.hash, x, n, r,
|
||||
bucket->h.id) * (__u64)w;
|
||||
t = t >> 32;
|
||||
|
||||
/* descend to the left or right? */
|
||||
l = left(n);
|
||||
if (t < bucket->node_weights[l])
|
||||
n = l;
|
||||
else
|
||||
n = right(n);
|
||||
}
|
||||
|
||||
return bucket->h.items[n >> 1];
|
||||
}
|
||||
|
||||
|
||||
/* straw */
|
||||
|
||||
static int bucket_straw_choose(struct crush_bucket_straw *bucket,
|
||||
int x, int r)
|
||||
{
|
||||
int i;
|
||||
int high = 0;
|
||||
__u64 high_draw = 0;
|
||||
__u64 draw;
|
||||
|
||||
for (i = 0; i < bucket->h.size; i++) {
|
||||
draw = crush_hash32_3(bucket->h.hash, x, bucket->h.items[i], r);
|
||||
draw &= 0xffff;
|
||||
draw *= bucket->straws[i];
|
||||
if (i == 0 || draw > high_draw) {
|
||||
high = i;
|
||||
high_draw = draw;
|
||||
}
|
||||
}
|
||||
return bucket->h.items[high];
|
||||
}
|
||||
|
||||
static int crush_bucket_choose(struct crush_bucket *in, int x, int r)
|
||||
{
|
||||
dprintk("choose %d x=%d r=%d\n", in->id, x, r);
|
||||
switch (in->alg) {
|
||||
case CRUSH_BUCKET_UNIFORM:
|
||||
return bucket_uniform_choose((struct crush_bucket_uniform *)in,
|
||||
x, r);
|
||||
case CRUSH_BUCKET_LIST:
|
||||
return bucket_list_choose((struct crush_bucket_list *)in,
|
||||
x, r);
|
||||
case CRUSH_BUCKET_TREE:
|
||||
return bucket_tree_choose((struct crush_bucket_tree *)in,
|
||||
x, r);
|
||||
case CRUSH_BUCKET_STRAW:
|
||||
return bucket_straw_choose((struct crush_bucket_straw *)in,
|
||||
x, r);
|
||||
default:
|
||||
BUG_ON(1);
|
||||
return in->items[0];
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* true if device is marked "out" (failed, fully offloaded)
|
||||
* of the cluster
|
||||
*/
|
||||
static int is_out(struct crush_map *map, __u32 *weight, int item, int x)
|
||||
{
|
||||
if (weight[item] >= 0x1000)
|
||||
return 0;
|
||||
if (weight[item] == 0)
|
||||
return 1;
|
||||
if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff)
|
||||
< weight[item])
|
||||
return 0;
|
||||
return 1;
|
||||
}
|
||||
|
||||
/**
|
||||
* crush_choose - choose numrep distinct items of given type
|
||||
* @map: the crush_map
|
||||
* @bucket: the bucket we are choose an item from
|
||||
* @x: crush input value
|
||||
* @numrep: the number of items to choose
|
||||
* @type: the type of item to choose
|
||||
* @out: pointer to output vector
|
||||
* @outpos: our position in that vector
|
||||
* @firstn: true if choosing "first n" items, false if choosing "indep"
|
||||
* @recurse_to_leaf: true if we want one device under each item of given type
|
||||
* @out2: second output vector for leaf items (if @recurse_to_leaf)
|
||||
*/
|
||||
static int crush_choose(struct crush_map *map,
|
||||
struct crush_bucket *bucket,
|
||||
__u32 *weight,
|
||||
int x, int numrep, int type,
|
||||
int *out, int outpos,
|
||||
int firstn, int recurse_to_leaf,
|
||||
int *out2)
|
||||
{
|
||||
int rep;
|
||||
int ftotal, flocal;
|
||||
int retry_descent, retry_bucket, skip_rep;
|
||||
struct crush_bucket *in = bucket;
|
||||
int r;
|
||||
int i;
|
||||
int item = 0;
|
||||
int itemtype;
|
||||
int collide, reject;
|
||||
const int orig_tries = 5; /* attempts before we fall back to search */
|
||||
dprintk("choose bucket %d x %d outpos %d\n", bucket->id, x, outpos);
|
||||
|
||||
for (rep = outpos; rep < numrep; rep++) {
|
||||
/* keep trying until we get a non-out, non-colliding item */
|
||||
ftotal = 0;
|
||||
skip_rep = 0;
|
||||
do {
|
||||
retry_descent = 0;
|
||||
in = bucket; /* initial bucket */
|
||||
|
||||
/* choose through intervening buckets */
|
||||
flocal = 0;
|
||||
do {
|
||||
collide = 0;
|
||||
retry_bucket = 0;
|
||||
r = rep;
|
||||
if (in->alg == CRUSH_BUCKET_UNIFORM) {
|
||||
/* be careful */
|
||||
if (firstn || numrep >= in->size)
|
||||
/* r' = r + f_total */
|
||||
r += ftotal;
|
||||
else if (in->size % numrep == 0)
|
||||
/* r'=r+(n+1)*f_local */
|
||||
r += (numrep+1) *
|
||||
(flocal+ftotal);
|
||||
else
|
||||
/* r' = r + n*f_local */
|
||||
r += numrep * (flocal+ftotal);
|
||||
} else {
|
||||
if (firstn)
|
||||
/* r' = r + f_total */
|
||||
r += ftotal;
|
||||
else
|
||||
/* r' = r + n*f_local */
|
||||
r += numrep * (flocal+ftotal);
|
||||
}
|
||||
|
||||
/* bucket choose */
|
||||
if (in->size == 0) {
|
||||
reject = 1;
|
||||
goto reject;
|
||||
}
|
||||
if (flocal >= (in->size>>1) &&
|
||||
flocal > orig_tries)
|
||||
item = bucket_perm_choose(in, x, r);
|
||||
else
|
||||
item = crush_bucket_choose(in, x, r);
|
||||
BUG_ON(item >= map->max_devices);
|
||||
|
||||
/* desired type? */
|
||||
if (item < 0)
|
||||
itemtype = map->buckets[-1-item]->type;
|
||||
else
|
||||
itemtype = 0;
|
||||
dprintk(" item %d type %d\n", item, itemtype);
|
||||
|
||||
/* keep going? */
|
||||
if (itemtype != type) {
|
||||
BUG_ON(item >= 0 ||
|
||||
(-1-item) >= map->max_buckets);
|
||||
in = map->buckets[-1-item];
|
||||
continue;
|
||||
}
|
||||
|
||||
/* collision? */
|
||||
for (i = 0; i < outpos; i++) {
|
||||
if (out[i] == item) {
|
||||
collide = 1;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (recurse_to_leaf &&
|
||||
item < 0 &&
|
||||
crush_choose(map, map->buckets[-1-item],
|
||||
weight,
|
||||
x, outpos+1, 0,
|
||||
out2, outpos,
|
||||
firstn, 0, NULL) <= outpos) {
|
||||
reject = 1;
|
||||
} else {
|
||||
/* out? */
|
||||
if (itemtype == 0)
|
||||
reject = is_out(map, weight,
|
||||
item, x);
|
||||
else
|
||||
reject = 0;
|
||||
}
|
||||
|
||||
reject:
|
||||
if (reject || collide) {
|
||||
ftotal++;
|
||||
flocal++;
|
||||
|
||||
if (collide && flocal < 3)
|
||||
/* retry locally a few times */
|
||||
retry_bucket = 1;
|
||||
else if (flocal < in->size + orig_tries)
|
||||
/* exhaustive bucket search */
|
||||
retry_bucket = 1;
|
||||
else if (ftotal < 20)
|
||||
/* then retry descent */
|
||||
retry_descent = 1;
|
||||
else
|
||||
/* else give up */
|
||||
skip_rep = 1;
|
||||
dprintk(" reject %d collide %d "
|
||||
"ftotal %d flocal %d\n",
|
||||
reject, collide, ftotal,
|
||||
flocal);
|
||||
}
|
||||
} while (retry_bucket);
|
||||
} while (retry_descent);
|
||||
|
||||
if (skip_rep) {
|
||||
dprintk("skip rep\n");
|
||||
continue;
|
||||
}
|
||||
|
||||
dprintk("choose got %d\n", item);
|
||||
out[outpos] = item;
|
||||
outpos++;
|
||||
}
|
||||
|
||||
dprintk("choose returns %d\n", outpos);
|
||||
return outpos;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* crush_do_rule - calculate a mapping with the given input and rule
|
||||
* @map: the crush_map
|
||||
* @ruleno: the rule id
|
||||
* @x: hash input
|
||||
* @result: pointer to result vector
|
||||
* @result_max: maximum result size
|
||||
* @force: force initial replica choice; -1 for none
|
||||
*/
|
||||
int crush_do_rule(struct crush_map *map,
|
||||
int ruleno, int x, int *result, int result_max,
|
||||
int force, __u32 *weight)
|
||||
{
|
||||
int result_len;
|
||||
int force_context[CRUSH_MAX_DEPTH];
|
||||
int force_pos = -1;
|
||||
int a[CRUSH_MAX_SET];
|
||||
int b[CRUSH_MAX_SET];
|
||||
int c[CRUSH_MAX_SET];
|
||||
int recurse_to_leaf;
|
||||
int *w;
|
||||
int wsize = 0;
|
||||
int *o;
|
||||
int osize;
|
||||
int *tmp;
|
||||
struct crush_rule *rule;
|
||||
int step;
|
||||
int i, j;
|
||||
int numrep;
|
||||
int firstn;
|
||||
int rc = -1;
|
||||
|
||||
BUG_ON(ruleno >= map->max_rules);
|
||||
|
||||
rule = map->rules[ruleno];
|
||||
result_len = 0;
|
||||
w = a;
|
||||
o = b;
|
||||
|
||||
/*
|
||||
* determine hierarchical context of force, if any. note
|
||||
* that this may or may not correspond to the specific types
|
||||
* referenced by the crush rule.
|
||||
*/
|
||||
if (force >= 0) {
|
||||
if (force >= map->max_devices ||
|
||||
map->device_parents[force] == 0) {
|
||||
/*dprintk("CRUSH: forcefed device dne\n");*/
|
||||
rc = -1; /* force fed device dne */
|
||||
goto out;
|
||||
}
|
||||
if (!is_out(map, weight, force, x)) {
|
||||
while (1) {
|
||||
force_context[++force_pos] = force;
|
||||
if (force >= 0)
|
||||
force = map->device_parents[force];
|
||||
else
|
||||
force = map->bucket_parents[-1-force];
|
||||
if (force == 0)
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for (step = 0; step < rule->len; step++) {
|
||||
firstn = 0;
|
||||
switch (rule->steps[step].op) {
|
||||
case CRUSH_RULE_TAKE:
|
||||
w[0] = rule->steps[step].arg1;
|
||||
if (force_pos >= 0) {
|
||||
BUG_ON(force_context[force_pos] != w[0]);
|
||||
force_pos--;
|
||||
}
|
||||
wsize = 1;
|
||||
break;
|
||||
|
||||
case CRUSH_RULE_CHOOSE_LEAF_FIRSTN:
|
||||
case CRUSH_RULE_CHOOSE_FIRSTN:
|
||||
firstn = 1;
|
||||
case CRUSH_RULE_CHOOSE_LEAF_INDEP:
|
||||
case CRUSH_RULE_CHOOSE_INDEP:
|
||||
BUG_ON(wsize == 0);
|
||||
|
||||
recurse_to_leaf =
|
||||
rule->steps[step].op ==
|
||||
CRUSH_RULE_CHOOSE_LEAF_FIRSTN ||
|
||||
rule->steps[step].op ==
|
||||
CRUSH_RULE_CHOOSE_LEAF_INDEP;
|
||||
|
||||
/* reset output */
|
||||
osize = 0;
|
||||
|
||||
for (i = 0; i < wsize; i++) {
|
||||
/*
|
||||
* see CRUSH_N, CRUSH_N_MINUS macros.
|
||||
* basically, numrep <= 0 means relative to
|
||||
* the provided result_max
|
||||
*/
|
||||
numrep = rule->steps[step].arg1;
|
||||
if (numrep <= 0) {
|
||||
numrep += result_max;
|
||||
if (numrep <= 0)
|
||||
continue;
|
||||
}
|
||||
j = 0;
|
||||
if (osize == 0 && force_pos >= 0) {
|
||||
/* skip any intermediate types */
|
||||
while (force_pos &&
|
||||
force_context[force_pos] < 0 &&
|
||||
rule->steps[step].arg2 !=
|
||||
map->buckets[-1 -
|
||||
force_context[force_pos]]->type)
|
||||
force_pos--;
|
||||
o[osize] = force_context[force_pos];
|
||||
if (recurse_to_leaf)
|
||||
c[osize] = force_context[0];
|
||||
j++;
|
||||
force_pos--;
|
||||
}
|
||||
osize += crush_choose(map,
|
||||
map->buckets[-1-w[i]],
|
||||
weight,
|
||||
x, numrep,
|
||||
rule->steps[step].arg2,
|
||||
o+osize, j,
|
||||
firstn,
|
||||
recurse_to_leaf, c+osize);
|
||||
}
|
||||
|
||||
if (recurse_to_leaf)
|
||||
/* copy final _leaf_ values to output set */
|
||||
memcpy(o, c, osize*sizeof(*o));
|
||||
|
||||
/* swap t and w arrays */
|
||||
tmp = o;
|
||||
o = w;
|
||||
w = tmp;
|
||||
wsize = osize;
|
||||
break;
|
||||
|
||||
|
||||
case CRUSH_RULE_EMIT:
|
||||
for (i = 0; i < wsize && result_len < result_max; i++) {
|
||||
result[result_len] = w[i];
|
||||
result_len++;
|
||||
}
|
||||
wsize = 0;
|
||||
break;
|
||||
|
||||
default:
|
||||
BUG_ON(1);
|
||||
}
|
||||
}
|
||||
rc = result_len;
|
||||
|
||||
out:
|
||||
return rc;
|
||||
}
|
||||
|
||||
|
20
fs/ceph/crush/mapper.h
Normal file
20
fs/ceph/crush/mapper.h
Normal file
@ -0,0 +1,20 @@
|
||||
#ifndef _CRUSH_MAPPER_H
|
||||
#define _CRUSH_MAPPER_H
|
||||
|
||||
/*
|
||||
* CRUSH functions for find rules and then mapping an input to an
|
||||
* output set.
|
||||
*
|
||||
* LGPL2
|
||||
*/
|
||||
|
||||
#include "crush.h"
|
||||
|
||||
extern int crush_find_rule(struct crush_map *map, int pool, int type, int size);
|
||||
extern int crush_do_rule(struct crush_map *map,
|
||||
int ruleno,
|
||||
int x, int *result, int result_max,
|
||||
int forcefeed, /* -1 for none */
|
||||
__u32 *weights);
|
||||
|
||||
#endif
|
408
fs/ceph/crypto.c
Normal file
408
fs/ceph/crypto.c
Normal file
@ -0,0 +1,408 @@
|
||||
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/err.h>
|
||||
#include <linux/scatterlist.h>
|
||||
#include <crypto/hash.h>
|
||||
|
||||
#include "crypto.h"
|
||||
#include "decode.h"
|
||||
|
||||
int ceph_crypto_key_encode(struct ceph_crypto_key *key, void **p, void *end)
|
||||
{
|
||||
if (*p + sizeof(u16) + sizeof(key->created) +
|
||||
sizeof(u16) + key->len > end)
|
||||
return -ERANGE;
|
||||
ceph_encode_16(p, key->type);
|
||||
ceph_encode_copy(p, &key->created, sizeof(key->created));
|
||||
ceph_encode_16(p, key->len);
|
||||
ceph_encode_copy(p, key->key, key->len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_crypto_key_decode(struct ceph_crypto_key *key, void **p, void *end)
|
||||
{
|
||||
ceph_decode_need(p, end, 2*sizeof(u16) + sizeof(key->created), bad);
|
||||
key->type = ceph_decode_16(p);
|
||||
ceph_decode_copy(p, &key->created, sizeof(key->created));
|
||||
key->len = ceph_decode_16(p);
|
||||
ceph_decode_need(p, end, key->len, bad);
|
||||
key->key = kmalloc(key->len, GFP_NOFS);
|
||||
if (!key->key)
|
||||
return -ENOMEM;
|
||||
ceph_decode_copy(p, key->key, key->len);
|
||||
return 0;
|
||||
|
||||
bad:
|
||||
dout("failed to decode crypto key\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
int ceph_crypto_key_unarmor(struct ceph_crypto_key *key, const char *inkey)
|
||||
{
|
||||
int inlen = strlen(inkey);
|
||||
int blen = inlen * 3 / 4;
|
||||
void *buf, *p;
|
||||
int ret;
|
||||
|
||||
dout("crypto_key_unarmor %s\n", inkey);
|
||||
buf = kmalloc(blen, GFP_NOFS);
|
||||
if (!buf)
|
||||
return -ENOMEM;
|
||||
blen = ceph_unarmor(buf, inkey, inkey+inlen);
|
||||
if (blen < 0) {
|
||||
kfree(buf);
|
||||
return blen;
|
||||
}
|
||||
|
||||
p = buf;
|
||||
ret = ceph_crypto_key_decode(key, &p, p + blen);
|
||||
kfree(buf);
|
||||
if (ret)
|
||||
return ret;
|
||||
dout("crypto_key_unarmor key %p type %d len %d\n", key,
|
||||
key->type, key->len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
|
||||
#define AES_KEY_SIZE 16
|
||||
|
||||
static struct crypto_blkcipher *ceph_crypto_alloc_cipher(void)
|
||||
{
|
||||
return crypto_alloc_blkcipher("cbc(aes)", 0, CRYPTO_ALG_ASYNC);
|
||||
}
|
||||
|
||||
const u8 *aes_iv = "cephsageyudagreg";
|
||||
|
||||
int ceph_aes_encrypt(const void *key, int key_len, void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
struct scatterlist sg_in[2], sg_out[1];
|
||||
struct crypto_blkcipher *tfm = ceph_crypto_alloc_cipher();
|
||||
struct blkcipher_desc desc = { .tfm = tfm, .flags = 0 };
|
||||
int ret;
|
||||
void *iv;
|
||||
int ivsize;
|
||||
size_t zero_padding = (0x10 - (src_len & 0x0f));
|
||||
char pad[16];
|
||||
|
||||
if (IS_ERR(tfm))
|
||||
return PTR_ERR(tfm);
|
||||
|
||||
memset(pad, zero_padding, zero_padding);
|
||||
|
||||
*dst_len = src_len + zero_padding;
|
||||
|
||||
crypto_blkcipher_setkey((void *)tfm, key, key_len);
|
||||
sg_init_table(sg_in, 2);
|
||||
sg_set_buf(&sg_in[0], src, src_len);
|
||||
sg_set_buf(&sg_in[1], pad, zero_padding);
|
||||
sg_init_table(sg_out, 1);
|
||||
sg_set_buf(sg_out, dst, *dst_len);
|
||||
iv = crypto_blkcipher_crt(tfm)->iv;
|
||||
ivsize = crypto_blkcipher_ivsize(tfm);
|
||||
|
||||
memcpy(iv, aes_iv, ivsize);
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "enc key: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
key, key_len, 1);
|
||||
print_hex_dump(KERN_ERR, "enc src: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
src, src_len, 1);
|
||||
print_hex_dump(KERN_ERR, "enc pad: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
pad, zero_padding, 1);
|
||||
*/
|
||||
ret = crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
|
||||
src_len + zero_padding);
|
||||
crypto_free_blkcipher(tfm);
|
||||
if (ret < 0)
|
||||
pr_err("ceph_aes_crypt failed %d\n", ret);
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "enc out: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
dst, *dst_len, 1);
|
||||
*/
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_aes_encrypt2(const void *key, int key_len, void *dst, size_t *dst_len,
|
||||
const void *src1, size_t src1_len,
|
||||
const void *src2, size_t src2_len)
|
||||
{
|
||||
struct scatterlist sg_in[3], sg_out[1];
|
||||
struct crypto_blkcipher *tfm = ceph_crypto_alloc_cipher();
|
||||
struct blkcipher_desc desc = { .tfm = tfm, .flags = 0 };
|
||||
int ret;
|
||||
void *iv;
|
||||
int ivsize;
|
||||
size_t zero_padding = (0x10 - ((src1_len + src2_len) & 0x0f));
|
||||
char pad[16];
|
||||
|
||||
if (IS_ERR(tfm))
|
||||
return PTR_ERR(tfm);
|
||||
|
||||
memset(pad, zero_padding, zero_padding);
|
||||
|
||||
*dst_len = src1_len + src2_len + zero_padding;
|
||||
|
||||
crypto_blkcipher_setkey((void *)tfm, key, key_len);
|
||||
sg_init_table(sg_in, 3);
|
||||
sg_set_buf(&sg_in[0], src1, src1_len);
|
||||
sg_set_buf(&sg_in[1], src2, src2_len);
|
||||
sg_set_buf(&sg_in[2], pad, zero_padding);
|
||||
sg_init_table(sg_out, 1);
|
||||
sg_set_buf(sg_out, dst, *dst_len);
|
||||
iv = crypto_blkcipher_crt(tfm)->iv;
|
||||
ivsize = crypto_blkcipher_ivsize(tfm);
|
||||
|
||||
memcpy(iv, aes_iv, ivsize);
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "enc key: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
key, key_len, 1);
|
||||
print_hex_dump(KERN_ERR, "enc src1: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
src1, src1_len, 1);
|
||||
print_hex_dump(KERN_ERR, "enc src2: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
src2, src2_len, 1);
|
||||
print_hex_dump(KERN_ERR, "enc pad: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
pad, zero_padding, 1);
|
||||
*/
|
||||
ret = crypto_blkcipher_encrypt(&desc, sg_out, sg_in,
|
||||
src1_len + src2_len + zero_padding);
|
||||
crypto_free_blkcipher(tfm);
|
||||
if (ret < 0)
|
||||
pr_err("ceph_aes_crypt2 failed %d\n", ret);
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "enc out: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
dst, *dst_len, 1);
|
||||
*/
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_aes_decrypt(const void *key, int key_len, void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
struct scatterlist sg_in[1], sg_out[2];
|
||||
struct crypto_blkcipher *tfm = ceph_crypto_alloc_cipher();
|
||||
struct blkcipher_desc desc = { .tfm = tfm };
|
||||
char pad[16];
|
||||
void *iv;
|
||||
int ivsize;
|
||||
int ret;
|
||||
int last_byte;
|
||||
|
||||
if (IS_ERR(tfm))
|
||||
return PTR_ERR(tfm);
|
||||
|
||||
crypto_blkcipher_setkey((void *)tfm, key, key_len);
|
||||
sg_init_table(sg_in, 1);
|
||||
sg_init_table(sg_out, 2);
|
||||
sg_set_buf(sg_in, src, src_len);
|
||||
sg_set_buf(&sg_out[0], dst, *dst_len);
|
||||
sg_set_buf(&sg_out[1], pad, sizeof(pad));
|
||||
|
||||
iv = crypto_blkcipher_crt(tfm)->iv;
|
||||
ivsize = crypto_blkcipher_ivsize(tfm);
|
||||
|
||||
memcpy(iv, aes_iv, ivsize);
|
||||
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "dec key: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
key, key_len, 1);
|
||||
print_hex_dump(KERN_ERR, "dec in: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
src, src_len, 1);
|
||||
*/
|
||||
|
||||
ret = crypto_blkcipher_decrypt(&desc, sg_out, sg_in, src_len);
|
||||
crypto_free_blkcipher(tfm);
|
||||
if (ret < 0) {
|
||||
pr_err("ceph_aes_decrypt failed %d\n", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (src_len <= *dst_len)
|
||||
last_byte = ((char *)dst)[src_len - 1];
|
||||
else
|
||||
last_byte = pad[src_len - *dst_len - 1];
|
||||
if (last_byte <= 16 && src_len >= last_byte) {
|
||||
*dst_len = src_len - last_byte;
|
||||
} else {
|
||||
pr_err("ceph_aes_decrypt got bad padding %d on src len %d\n",
|
||||
last_byte, (int)src_len);
|
||||
return -EPERM; /* bad padding */
|
||||
}
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "dec out: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
dst, *dst_len, 1);
|
||||
*/
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_aes_decrypt2(const void *key, int key_len,
|
||||
void *dst1, size_t *dst1_len,
|
||||
void *dst2, size_t *dst2_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
struct scatterlist sg_in[1], sg_out[3];
|
||||
struct crypto_blkcipher *tfm = ceph_crypto_alloc_cipher();
|
||||
struct blkcipher_desc desc = { .tfm = tfm };
|
||||
char pad[16];
|
||||
void *iv;
|
||||
int ivsize;
|
||||
int ret;
|
||||
int last_byte;
|
||||
|
||||
if (IS_ERR(tfm))
|
||||
return PTR_ERR(tfm);
|
||||
|
||||
sg_init_table(sg_in, 1);
|
||||
sg_set_buf(sg_in, src, src_len);
|
||||
sg_init_table(sg_out, 3);
|
||||
sg_set_buf(&sg_out[0], dst1, *dst1_len);
|
||||
sg_set_buf(&sg_out[1], dst2, *dst2_len);
|
||||
sg_set_buf(&sg_out[2], pad, sizeof(pad));
|
||||
|
||||
crypto_blkcipher_setkey((void *)tfm, key, key_len);
|
||||
iv = crypto_blkcipher_crt(tfm)->iv;
|
||||
ivsize = crypto_blkcipher_ivsize(tfm);
|
||||
|
||||
memcpy(iv, aes_iv, ivsize);
|
||||
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "dec key: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
key, key_len, 1);
|
||||
print_hex_dump(KERN_ERR, "dec in: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
src, src_len, 1);
|
||||
*/
|
||||
|
||||
ret = crypto_blkcipher_decrypt(&desc, sg_out, sg_in, src_len);
|
||||
crypto_free_blkcipher(tfm);
|
||||
if (ret < 0) {
|
||||
pr_err("ceph_aes_decrypt failed %d\n", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (src_len <= *dst1_len)
|
||||
last_byte = ((char *)dst1)[src_len - 1];
|
||||
else if (src_len <= *dst1_len + *dst2_len)
|
||||
last_byte = ((char *)dst2)[src_len - *dst1_len - 1];
|
||||
else
|
||||
last_byte = pad[src_len - *dst1_len - *dst2_len - 1];
|
||||
if (last_byte <= 16 && src_len >= last_byte) {
|
||||
src_len -= last_byte;
|
||||
} else {
|
||||
pr_err("ceph_aes_decrypt got bad padding %d on src len %d\n",
|
||||
last_byte, (int)src_len);
|
||||
return -EPERM; /* bad padding */
|
||||
}
|
||||
|
||||
if (src_len < *dst1_len) {
|
||||
*dst1_len = src_len;
|
||||
*dst2_len = 0;
|
||||
} else {
|
||||
*dst2_len = src_len - *dst1_len;
|
||||
}
|
||||
/*
|
||||
print_hex_dump(KERN_ERR, "dec out1: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
dst1, *dst1_len, 1);
|
||||
print_hex_dump(KERN_ERR, "dec out2: ", DUMP_PREFIX_NONE, 16, 1,
|
||||
dst2, *dst2_len, 1);
|
||||
*/
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
int ceph_decrypt(struct ceph_crypto_key *secret, void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
switch (secret->type) {
|
||||
case CEPH_CRYPTO_NONE:
|
||||
if (*dst_len < src_len)
|
||||
return -ERANGE;
|
||||
memcpy(dst, src, src_len);
|
||||
*dst_len = src_len;
|
||||
return 0;
|
||||
|
||||
case CEPH_CRYPTO_AES:
|
||||
return ceph_aes_decrypt(secret->key, secret->len, dst,
|
||||
dst_len, src, src_len);
|
||||
|
||||
default:
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
|
||||
int ceph_decrypt2(struct ceph_crypto_key *secret,
|
||||
void *dst1, size_t *dst1_len,
|
||||
void *dst2, size_t *dst2_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
size_t t;
|
||||
|
||||
switch (secret->type) {
|
||||
case CEPH_CRYPTO_NONE:
|
||||
if (*dst1_len + *dst2_len < src_len)
|
||||
return -ERANGE;
|
||||
t = min(*dst1_len, src_len);
|
||||
memcpy(dst1, src, t);
|
||||
*dst1_len = t;
|
||||
src += t;
|
||||
src_len -= t;
|
||||
if (src_len) {
|
||||
t = min(*dst2_len, src_len);
|
||||
memcpy(dst2, src, t);
|
||||
*dst2_len = t;
|
||||
}
|
||||
return 0;
|
||||
|
||||
case CEPH_CRYPTO_AES:
|
||||
return ceph_aes_decrypt2(secret->key, secret->len,
|
||||
dst1, dst1_len, dst2, dst2_len,
|
||||
src, src_len);
|
||||
|
||||
default:
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
|
||||
int ceph_encrypt(struct ceph_crypto_key *secret, void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len)
|
||||
{
|
||||
switch (secret->type) {
|
||||
case CEPH_CRYPTO_NONE:
|
||||
if (*dst_len < src_len)
|
||||
return -ERANGE;
|
||||
memcpy(dst, src, src_len);
|
||||
*dst_len = src_len;
|
||||
return 0;
|
||||
|
||||
case CEPH_CRYPTO_AES:
|
||||
return ceph_aes_encrypt(secret->key, secret->len, dst,
|
||||
dst_len, src, src_len);
|
||||
|
||||
default:
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
|
||||
int ceph_encrypt2(struct ceph_crypto_key *secret, void *dst, size_t *dst_len,
|
||||
const void *src1, size_t src1_len,
|
||||
const void *src2, size_t src2_len)
|
||||
{
|
||||
switch (secret->type) {
|
||||
case CEPH_CRYPTO_NONE:
|
||||
if (*dst_len < src1_len + src2_len)
|
||||
return -ERANGE;
|
||||
memcpy(dst, src1, src1_len);
|
||||
memcpy(dst + src1_len, src2, src2_len);
|
||||
*dst_len = src1_len + src2_len;
|
||||
return 0;
|
||||
|
||||
case CEPH_CRYPTO_AES:
|
||||
return ceph_aes_encrypt2(secret->key, secret->len, dst, dst_len,
|
||||
src1, src1_len, src2, src2_len);
|
||||
|
||||
default:
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
48
fs/ceph/crypto.h
Normal file
48
fs/ceph/crypto.h
Normal file
@ -0,0 +1,48 @@
|
||||
#ifndef _FS_CEPH_CRYPTO_H
|
||||
#define _FS_CEPH_CRYPTO_H
|
||||
|
||||
#include "types.h"
|
||||
#include "buffer.h"
|
||||
|
||||
/*
|
||||
* cryptographic secret
|
||||
*/
|
||||
struct ceph_crypto_key {
|
||||
int type;
|
||||
struct ceph_timespec created;
|
||||
int len;
|
||||
void *key;
|
||||
};
|
||||
|
||||
static inline void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
|
||||
{
|
||||
kfree(key->key);
|
||||
}
|
||||
|
||||
extern int ceph_crypto_key_encode(struct ceph_crypto_key *key,
|
||||
void **p, void *end);
|
||||
extern int ceph_crypto_key_decode(struct ceph_crypto_key *key,
|
||||
void **p, void *end);
|
||||
extern int ceph_crypto_key_unarmor(struct ceph_crypto_key *key, const char *in);
|
||||
|
||||
/* crypto.c */
|
||||
extern int ceph_decrypt(struct ceph_crypto_key *secret,
|
||||
void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len);
|
||||
extern int ceph_encrypt(struct ceph_crypto_key *secret,
|
||||
void *dst, size_t *dst_len,
|
||||
const void *src, size_t src_len);
|
||||
extern int ceph_decrypt2(struct ceph_crypto_key *secret,
|
||||
void *dst1, size_t *dst1_len,
|
||||
void *dst2, size_t *dst2_len,
|
||||
const void *src, size_t src_len);
|
||||
extern int ceph_encrypt2(struct ceph_crypto_key *secret,
|
||||
void *dst, size_t *dst_len,
|
||||
const void *src1, size_t src1_len,
|
||||
const void *src2, size_t src2_len);
|
||||
|
||||
/* armor.c */
|
||||
extern int ceph_armor(char *dst, const void *src, const void *end);
|
||||
extern int ceph_unarmor(void *dst, const char *src, const char *end);
|
||||
|
||||
#endif
|
483
fs/ceph/debugfs.c
Normal file
483
fs/ceph/debugfs.c
Normal file
@ -0,0 +1,483 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/device.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/ctype.h>
|
||||
#include <linux/debugfs.h>
|
||||
#include <linux/seq_file.h>
|
||||
|
||||
#include "super.h"
|
||||
#include "mds_client.h"
|
||||
#include "mon_client.h"
|
||||
#include "auth.h"
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
|
||||
/*
|
||||
* Implement /sys/kernel/debug/ceph fun
|
||||
*
|
||||
* /sys/kernel/debug/ceph/client* - an instance of the ceph client
|
||||
* .../osdmap - current osdmap
|
||||
* .../mdsmap - current mdsmap
|
||||
* .../monmap - current monmap
|
||||
* .../osdc - active osd requests
|
||||
* .../mdsc - active mds requests
|
||||
* .../monc - mon client state
|
||||
* .../dentry_lru - dump contents of dentry lru
|
||||
* .../caps - expose cap (reservation) stats
|
||||
* .../bdi - symlink to ../../bdi/something
|
||||
*/
|
||||
|
||||
static struct dentry *ceph_debugfs_dir;
|
||||
|
||||
static int monmap_show(struct seq_file *s, void *p)
|
||||
{
|
||||
int i;
|
||||
struct ceph_client *client = s->private;
|
||||
|
||||
if (client->monc.monmap == NULL)
|
||||
return 0;
|
||||
|
||||
seq_printf(s, "epoch %d\n", client->monc.monmap->epoch);
|
||||
for (i = 0; i < client->monc.monmap->num_mon; i++) {
|
||||
struct ceph_entity_inst *inst =
|
||||
&client->monc.monmap->mon_inst[i];
|
||||
|
||||
seq_printf(s, "\t%s%lld\t%s\n",
|
||||
ENTITY_NAME(inst->name),
|
||||
pr_addr(&inst->addr.in_addr));
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int mdsmap_show(struct seq_file *s, void *p)
|
||||
{
|
||||
int i;
|
||||
struct ceph_client *client = s->private;
|
||||
|
||||
if (client->mdsc.mdsmap == NULL)
|
||||
return 0;
|
||||
seq_printf(s, "epoch %d\n", client->mdsc.mdsmap->m_epoch);
|
||||
seq_printf(s, "root %d\n", client->mdsc.mdsmap->m_root);
|
||||
seq_printf(s, "session_timeout %d\n",
|
||||
client->mdsc.mdsmap->m_session_timeout);
|
||||
seq_printf(s, "session_autoclose %d\n",
|
||||
client->mdsc.mdsmap->m_session_autoclose);
|
||||
for (i = 0; i < client->mdsc.mdsmap->m_max_mds; i++) {
|
||||
struct ceph_entity_addr *addr =
|
||||
&client->mdsc.mdsmap->m_info[i].addr;
|
||||
int state = client->mdsc.mdsmap->m_info[i].state;
|
||||
|
||||
seq_printf(s, "\tmds%d\t%s\t(%s)\n", i, pr_addr(&addr->in_addr),
|
||||
ceph_mds_state_name(state));
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int osdmap_show(struct seq_file *s, void *p)
|
||||
{
|
||||
int i;
|
||||
struct ceph_client *client = s->private;
|
||||
struct rb_node *n;
|
||||
|
||||
if (client->osdc.osdmap == NULL)
|
||||
return 0;
|
||||
seq_printf(s, "epoch %d\n", client->osdc.osdmap->epoch);
|
||||
seq_printf(s, "flags%s%s\n",
|
||||
(client->osdc.osdmap->flags & CEPH_OSDMAP_NEARFULL) ?
|
||||
" NEARFULL" : "",
|
||||
(client->osdc.osdmap->flags & CEPH_OSDMAP_FULL) ?
|
||||
" FULL" : "");
|
||||
for (n = rb_first(&client->osdc.osdmap->pg_pools); n; n = rb_next(n)) {
|
||||
struct ceph_pg_pool_info *pool =
|
||||
rb_entry(n, struct ceph_pg_pool_info, node);
|
||||
seq_printf(s, "pg_pool %d pg_num %d / %d, lpg_num %d / %d\n",
|
||||
pool->id, pool->v.pg_num, pool->pg_num_mask,
|
||||
pool->v.lpg_num, pool->lpg_num_mask);
|
||||
}
|
||||
for (i = 0; i < client->osdc.osdmap->max_osd; i++) {
|
||||
struct ceph_entity_addr *addr =
|
||||
&client->osdc.osdmap->osd_addr[i];
|
||||
int state = client->osdc.osdmap->osd_state[i];
|
||||
char sb[64];
|
||||
|
||||
seq_printf(s, "\tosd%d\t%s\t%3d%%\t(%s)\n",
|
||||
i, pr_addr(&addr->in_addr),
|
||||
((client->osdc.osdmap->osd_weight[i]*100) >> 16),
|
||||
ceph_osdmap_state_str(sb, sizeof(sb), state));
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int monc_show(struct seq_file *s, void *p)
|
||||
{
|
||||
struct ceph_client *client = s->private;
|
||||
struct ceph_mon_statfs_request *req;
|
||||
struct ceph_mon_client *monc = &client->monc;
|
||||
struct rb_node *rp;
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
|
||||
if (monc->have_mdsmap)
|
||||
seq_printf(s, "have mdsmap %u\n", (unsigned)monc->have_mdsmap);
|
||||
if (monc->have_osdmap)
|
||||
seq_printf(s, "have osdmap %u\n", (unsigned)monc->have_osdmap);
|
||||
if (monc->want_next_osdmap)
|
||||
seq_printf(s, "want next osdmap\n");
|
||||
|
||||
for (rp = rb_first(&monc->statfs_request_tree); rp; rp = rb_next(rp)) {
|
||||
req = rb_entry(rp, struct ceph_mon_statfs_request, node);
|
||||
seq_printf(s, "%lld statfs\n", req->tid);
|
||||
}
|
||||
|
||||
mutex_unlock(&monc->mutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int mdsc_show(struct seq_file *s, void *p)
|
||||
{
|
||||
struct ceph_client *client = s->private;
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct ceph_mds_request *req;
|
||||
struct rb_node *rp;
|
||||
int pathlen;
|
||||
u64 pathbase;
|
||||
char *path;
|
||||
|
||||
mutex_lock(&mdsc->mutex);
|
||||
for (rp = rb_first(&mdsc->request_tree); rp; rp = rb_next(rp)) {
|
||||
req = rb_entry(rp, struct ceph_mds_request, r_node);
|
||||
|
||||
if (req->r_request)
|
||||
seq_printf(s, "%lld\tmds%d\t", req->r_tid, req->r_mds);
|
||||
else
|
||||
seq_printf(s, "%lld\t(no request)\t", req->r_tid);
|
||||
|
||||
seq_printf(s, "%s", ceph_mds_op_name(req->r_op));
|
||||
|
||||
if (req->r_got_unsafe)
|
||||
seq_printf(s, "\t(unsafe)");
|
||||
else
|
||||
seq_printf(s, "\t");
|
||||
|
||||
if (req->r_inode) {
|
||||
seq_printf(s, " #%llx", ceph_ino(req->r_inode));
|
||||
} else if (req->r_dentry) {
|
||||
path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
|
||||
&pathbase, 0);
|
||||
spin_lock(&req->r_dentry->d_lock);
|
||||
seq_printf(s, " #%llx/%.*s (%s)",
|
||||
ceph_ino(req->r_dentry->d_parent->d_inode),
|
||||
req->r_dentry->d_name.len,
|
||||
req->r_dentry->d_name.name,
|
||||
path ? path : "");
|
||||
spin_unlock(&req->r_dentry->d_lock);
|
||||
kfree(path);
|
||||
} else if (req->r_path1) {
|
||||
seq_printf(s, " #%llx/%s", req->r_ino1.ino,
|
||||
req->r_path1);
|
||||
}
|
||||
|
||||
if (req->r_old_dentry) {
|
||||
path = ceph_mdsc_build_path(req->r_old_dentry, &pathlen,
|
||||
&pathbase, 0);
|
||||
spin_lock(&req->r_old_dentry->d_lock);
|
||||
seq_printf(s, " #%llx/%.*s (%s)",
|
||||
ceph_ino(req->r_old_dentry->d_parent->d_inode),
|
||||
req->r_old_dentry->d_name.len,
|
||||
req->r_old_dentry->d_name.name,
|
||||
path ? path : "");
|
||||
spin_unlock(&req->r_old_dentry->d_lock);
|
||||
kfree(path);
|
||||
} else if (req->r_path2) {
|
||||
if (req->r_ino2.ino)
|
||||
seq_printf(s, " #%llx/%s", req->r_ino2.ino,
|
||||
req->r_path2);
|
||||
else
|
||||
seq_printf(s, " %s", req->r_path2);
|
||||
}
|
||||
|
||||
seq_printf(s, "\n");
|
||||
}
|
||||
mutex_unlock(&mdsc->mutex);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int osdc_show(struct seq_file *s, void *pp)
|
||||
{
|
||||
struct ceph_client *client = s->private;
|
||||
struct ceph_osd_client *osdc = &client->osdc;
|
||||
struct rb_node *p;
|
||||
|
||||
mutex_lock(&osdc->request_mutex);
|
||||
for (p = rb_first(&osdc->requests); p; p = rb_next(p)) {
|
||||
struct ceph_osd_request *req;
|
||||
struct ceph_osd_request_head *head;
|
||||
struct ceph_osd_op *op;
|
||||
int num_ops;
|
||||
int opcode, olen;
|
||||
int i;
|
||||
|
||||
req = rb_entry(p, struct ceph_osd_request, r_node);
|
||||
|
||||
seq_printf(s, "%lld\tosd%d\t%d.%x\t", req->r_tid,
|
||||
req->r_osd ? req->r_osd->o_osd : -1,
|
||||
le32_to_cpu(req->r_pgid.pool),
|
||||
le16_to_cpu(req->r_pgid.ps));
|
||||
|
||||
head = req->r_request->front.iov_base;
|
||||
op = (void *)(head + 1);
|
||||
|
||||
num_ops = le16_to_cpu(head->num_ops);
|
||||
olen = le32_to_cpu(head->object_len);
|
||||
seq_printf(s, "%.*s", olen,
|
||||
(const char *)(head->ops + num_ops));
|
||||
|
||||
if (req->r_reassert_version.epoch)
|
||||
seq_printf(s, "\t%u'%llu",
|
||||
(unsigned)le32_to_cpu(req->r_reassert_version.epoch),
|
||||
le64_to_cpu(req->r_reassert_version.version));
|
||||
else
|
||||
seq_printf(s, "\t");
|
||||
|
||||
for (i = 0; i < num_ops; i++) {
|
||||
opcode = le16_to_cpu(op->op);
|
||||
seq_printf(s, "\t%s", ceph_osd_op_name(opcode));
|
||||
op++;
|
||||
}
|
||||
|
||||
seq_printf(s, "\n");
|
||||
}
|
||||
mutex_unlock(&osdc->request_mutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int caps_show(struct seq_file *s, void *p)
|
||||
{
|
||||
struct ceph_client *client = p;
|
||||
int total, avail, used, reserved, min;
|
||||
|
||||
ceph_reservation_status(client, &total, &avail, &used, &reserved, &min);
|
||||
seq_printf(s, "total\t\t%d\n"
|
||||
"avail\t\t%d\n"
|
||||
"used\t\t%d\n"
|
||||
"reserved\t%d\n"
|
||||
"min\t%d\n",
|
||||
total, avail, used, reserved, min);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int dentry_lru_show(struct seq_file *s, void *ptr)
|
||||
{
|
||||
struct ceph_client *client = s->private;
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct ceph_dentry_info *di;
|
||||
|
||||
spin_lock(&mdsc->dentry_lru_lock);
|
||||
list_for_each_entry(di, &mdsc->dentry_lru, lru) {
|
||||
struct dentry *dentry = di->dentry;
|
||||
seq_printf(s, "%p %p\t%.*s\n",
|
||||
di, dentry, dentry->d_name.len, dentry->d_name.name);
|
||||
}
|
||||
spin_unlock(&mdsc->dentry_lru_lock);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#define DEFINE_SHOW_FUNC(name) \
|
||||
static int name##_open(struct inode *inode, struct file *file) \
|
||||
{ \
|
||||
struct seq_file *sf; \
|
||||
int ret; \
|
||||
\
|
||||
ret = single_open(file, name, NULL); \
|
||||
sf = file->private_data; \
|
||||
sf->private = inode->i_private; \
|
||||
return ret; \
|
||||
} \
|
||||
\
|
||||
static const struct file_operations name##_fops = { \
|
||||
.open = name##_open, \
|
||||
.read = seq_read, \
|
||||
.llseek = seq_lseek, \
|
||||
.release = single_release, \
|
||||
};
|
||||
|
||||
DEFINE_SHOW_FUNC(monmap_show)
|
||||
DEFINE_SHOW_FUNC(mdsmap_show)
|
||||
DEFINE_SHOW_FUNC(osdmap_show)
|
||||
DEFINE_SHOW_FUNC(monc_show)
|
||||
DEFINE_SHOW_FUNC(mdsc_show)
|
||||
DEFINE_SHOW_FUNC(osdc_show)
|
||||
DEFINE_SHOW_FUNC(dentry_lru_show)
|
||||
DEFINE_SHOW_FUNC(caps_show)
|
||||
|
||||
static int congestion_kb_set(void *data, u64 val)
|
||||
{
|
||||
struct ceph_client *client = (struct ceph_client *)data;
|
||||
|
||||
if (client)
|
||||
client->mount_args->congestion_kb = (int)val;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int congestion_kb_get(void *data, u64 *val)
|
||||
{
|
||||
struct ceph_client *client = (struct ceph_client *)data;
|
||||
|
||||
if (client)
|
||||
*val = (u64)client->mount_args->congestion_kb;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
DEFINE_SIMPLE_ATTRIBUTE(congestion_kb_fops, congestion_kb_get,
|
||||
congestion_kb_set, "%llu\n");
|
||||
|
||||
int __init ceph_debugfs_init(void)
|
||||
{
|
||||
ceph_debugfs_dir = debugfs_create_dir("ceph", NULL);
|
||||
if (!ceph_debugfs_dir)
|
||||
return -ENOMEM;
|
||||
return 0;
|
||||
}
|
||||
|
||||
void ceph_debugfs_cleanup(void)
|
||||
{
|
||||
debugfs_remove(ceph_debugfs_dir);
|
||||
}
|
||||
|
||||
int ceph_debugfs_client_init(struct ceph_client *client)
|
||||
{
|
||||
int ret = 0;
|
||||
char name[80];
|
||||
|
||||
snprintf(name, sizeof(name), FSID_FORMAT ".client%lld",
|
||||
PR_FSID(&client->fsid), client->monc.auth->global_id);
|
||||
|
||||
client->debugfs_dir = debugfs_create_dir(name, ceph_debugfs_dir);
|
||||
if (!client->debugfs_dir)
|
||||
goto out;
|
||||
|
||||
client->monc.debugfs_file = debugfs_create_file("monc",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&monc_show_fops);
|
||||
if (!client->monc.debugfs_file)
|
||||
goto out;
|
||||
|
||||
client->mdsc.debugfs_file = debugfs_create_file("mdsc",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&mdsc_show_fops);
|
||||
if (!client->mdsc.debugfs_file)
|
||||
goto out;
|
||||
|
||||
client->osdc.debugfs_file = debugfs_create_file("osdc",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&osdc_show_fops);
|
||||
if (!client->osdc.debugfs_file)
|
||||
goto out;
|
||||
|
||||
client->debugfs_monmap = debugfs_create_file("monmap",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&monmap_show_fops);
|
||||
if (!client->debugfs_monmap)
|
||||
goto out;
|
||||
|
||||
client->debugfs_mdsmap = debugfs_create_file("mdsmap",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&mdsmap_show_fops);
|
||||
if (!client->debugfs_mdsmap)
|
||||
goto out;
|
||||
|
||||
client->debugfs_osdmap = debugfs_create_file("osdmap",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&osdmap_show_fops);
|
||||
if (!client->debugfs_osdmap)
|
||||
goto out;
|
||||
|
||||
client->debugfs_dentry_lru = debugfs_create_file("dentry_lru",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&dentry_lru_show_fops);
|
||||
if (!client->debugfs_dentry_lru)
|
||||
goto out;
|
||||
|
||||
client->debugfs_caps = debugfs_create_file("caps",
|
||||
0400,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&caps_show_fops);
|
||||
if (!client->debugfs_caps)
|
||||
goto out;
|
||||
|
||||
client->debugfs_congestion_kb = debugfs_create_file("writeback_congestion_kb",
|
||||
0600,
|
||||
client->debugfs_dir,
|
||||
client,
|
||||
&congestion_kb_fops);
|
||||
if (!client->debugfs_congestion_kb)
|
||||
goto out;
|
||||
|
||||
sprintf(name, "../../bdi/%s", dev_name(client->sb->s_bdi->dev));
|
||||
client->debugfs_bdi = debugfs_create_symlink("bdi", client->debugfs_dir,
|
||||
name);
|
||||
|
||||
return 0;
|
||||
|
||||
out:
|
||||
ceph_debugfs_client_cleanup(client);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void ceph_debugfs_client_cleanup(struct ceph_client *client)
|
||||
{
|
||||
debugfs_remove(client->debugfs_bdi);
|
||||
debugfs_remove(client->debugfs_caps);
|
||||
debugfs_remove(client->debugfs_dentry_lru);
|
||||
debugfs_remove(client->debugfs_osdmap);
|
||||
debugfs_remove(client->debugfs_mdsmap);
|
||||
debugfs_remove(client->debugfs_monmap);
|
||||
debugfs_remove(client->osdc.debugfs_file);
|
||||
debugfs_remove(client->mdsc.debugfs_file);
|
||||
debugfs_remove(client->monc.debugfs_file);
|
||||
debugfs_remove(client->debugfs_congestion_kb);
|
||||
debugfs_remove(client->debugfs_dir);
|
||||
}
|
||||
|
||||
#else // CONFIG_DEBUG_FS
|
||||
|
||||
int __init ceph_debugfs_init(void)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
void ceph_debugfs_cleanup(void)
|
||||
{
|
||||
}
|
||||
|
||||
int ceph_debugfs_client_init(struct ceph_client *client)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
void ceph_debugfs_client_cleanup(struct ceph_client *client)
|
||||
{
|
||||
}
|
||||
|
||||
#endif // CONFIG_DEBUG_FS
|
194
fs/ceph/decode.h
Normal file
194
fs/ceph/decode.h
Normal file
@ -0,0 +1,194 @@
|
||||
#ifndef __CEPH_DECODE_H
|
||||
#define __CEPH_DECODE_H
|
||||
|
||||
#include <asm/unaligned.h>
|
||||
#include <linux/time.h>
|
||||
|
||||
#include "types.h"
|
||||
|
||||
/*
|
||||
* in all cases,
|
||||
* void **p pointer to position pointer
|
||||
* void *end pointer to end of buffer (last byte + 1)
|
||||
*/
|
||||
|
||||
static inline u64 ceph_decode_64(void **p)
|
||||
{
|
||||
u64 v = get_unaligned_le64(*p);
|
||||
*p += sizeof(u64);
|
||||
return v;
|
||||
}
|
||||
static inline u32 ceph_decode_32(void **p)
|
||||
{
|
||||
u32 v = get_unaligned_le32(*p);
|
||||
*p += sizeof(u32);
|
||||
return v;
|
||||
}
|
||||
static inline u16 ceph_decode_16(void **p)
|
||||
{
|
||||
u16 v = get_unaligned_le16(*p);
|
||||
*p += sizeof(u16);
|
||||
return v;
|
||||
}
|
||||
static inline u8 ceph_decode_8(void **p)
|
||||
{
|
||||
u8 v = *(u8 *)*p;
|
||||
(*p)++;
|
||||
return v;
|
||||
}
|
||||
static inline void ceph_decode_copy(void **p, void *pv, size_t n)
|
||||
{
|
||||
memcpy(pv, *p, n);
|
||||
*p += n;
|
||||
}
|
||||
|
||||
/*
|
||||
* bounds check input.
|
||||
*/
|
||||
#define ceph_decode_need(p, end, n, bad) \
|
||||
do { \
|
||||
if (unlikely(*(p) + (n) > (end))) \
|
||||
goto bad; \
|
||||
} while (0)
|
||||
|
||||
#define ceph_decode_64_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_decode_need(p, end, sizeof(u64), bad); \
|
||||
v = ceph_decode_64(p); \
|
||||
} while (0)
|
||||
#define ceph_decode_32_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_decode_need(p, end, sizeof(u32), bad); \
|
||||
v = ceph_decode_32(p); \
|
||||
} while (0)
|
||||
#define ceph_decode_16_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_decode_need(p, end, sizeof(u16), bad); \
|
||||
v = ceph_decode_16(p); \
|
||||
} while (0)
|
||||
#define ceph_decode_8_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_decode_need(p, end, sizeof(u8), bad); \
|
||||
v = ceph_decode_8(p); \
|
||||
} while (0)
|
||||
|
||||
#define ceph_decode_copy_safe(p, end, pv, n, bad) \
|
||||
do { \
|
||||
ceph_decode_need(p, end, n, bad); \
|
||||
ceph_decode_copy(p, pv, n); \
|
||||
} while (0)
|
||||
|
||||
/*
|
||||
* struct ceph_timespec <-> struct timespec
|
||||
*/
|
||||
static inline void ceph_decode_timespec(struct timespec *ts,
|
||||
const struct ceph_timespec *tv)
|
||||
{
|
||||
ts->tv_sec = le32_to_cpu(tv->tv_sec);
|
||||
ts->tv_nsec = le32_to_cpu(tv->tv_nsec);
|
||||
}
|
||||
static inline void ceph_encode_timespec(struct ceph_timespec *tv,
|
||||
const struct timespec *ts)
|
||||
{
|
||||
tv->tv_sec = cpu_to_le32(ts->tv_sec);
|
||||
tv->tv_nsec = cpu_to_le32(ts->tv_nsec);
|
||||
}
|
||||
|
||||
/*
|
||||
* sockaddr_storage <-> ceph_sockaddr
|
||||
*/
|
||||
static inline void ceph_encode_addr(struct ceph_entity_addr *a)
|
||||
{
|
||||
a->in_addr.ss_family = htons(a->in_addr.ss_family);
|
||||
}
|
||||
static inline void ceph_decode_addr(struct ceph_entity_addr *a)
|
||||
{
|
||||
a->in_addr.ss_family = ntohs(a->in_addr.ss_family);
|
||||
WARN_ON(a->in_addr.ss_family == 512);
|
||||
}
|
||||
|
||||
/*
|
||||
* encoders
|
||||
*/
|
||||
static inline void ceph_encode_64(void **p, u64 v)
|
||||
{
|
||||
put_unaligned_le64(v, (__le64 *)*p);
|
||||
*p += sizeof(u64);
|
||||
}
|
||||
static inline void ceph_encode_32(void **p, u32 v)
|
||||
{
|
||||
put_unaligned_le32(v, (__le32 *)*p);
|
||||
*p += sizeof(u32);
|
||||
}
|
||||
static inline void ceph_encode_16(void **p, u16 v)
|
||||
{
|
||||
put_unaligned_le16(v, (__le16 *)*p);
|
||||
*p += sizeof(u16);
|
||||
}
|
||||
static inline void ceph_encode_8(void **p, u8 v)
|
||||
{
|
||||
*(u8 *)*p = v;
|
||||
(*p)++;
|
||||
}
|
||||
static inline void ceph_encode_copy(void **p, const void *s, int len)
|
||||
{
|
||||
memcpy(*p, s, len);
|
||||
*p += len;
|
||||
}
|
||||
|
||||
/*
|
||||
* filepath, string encoders
|
||||
*/
|
||||
static inline void ceph_encode_filepath(void **p, void *end,
|
||||
u64 ino, const char *path)
|
||||
{
|
||||
u32 len = path ? strlen(path) : 0;
|
||||
BUG_ON(*p + sizeof(ino) + sizeof(len) + len > end);
|
||||
ceph_encode_8(p, 1);
|
||||
ceph_encode_64(p, ino);
|
||||
ceph_encode_32(p, len);
|
||||
if (len)
|
||||
memcpy(*p, path, len);
|
||||
*p += len;
|
||||
}
|
||||
|
||||
static inline void ceph_encode_string(void **p, void *end,
|
||||
const char *s, u32 len)
|
||||
{
|
||||
BUG_ON(*p + sizeof(len) + len > end);
|
||||
ceph_encode_32(p, len);
|
||||
if (len)
|
||||
memcpy(*p, s, len);
|
||||
*p += len;
|
||||
}
|
||||
|
||||
#define ceph_encode_need(p, end, n, bad) \
|
||||
do { \
|
||||
if (unlikely(*(p) + (n) > (end))) \
|
||||
goto bad; \
|
||||
} while (0)
|
||||
|
||||
#define ceph_encode_64_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_encode_need(p, end, sizeof(u64), bad); \
|
||||
ceph_encode_64(p, v); \
|
||||
} while (0)
|
||||
#define ceph_encode_32_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_encode_need(p, end, sizeof(u32), bad); \
|
||||
ceph_encode_32(p, v); \
|
||||
} while (0)
|
||||
#define ceph_encode_16_safe(p, end, v, bad) \
|
||||
do { \
|
||||
ceph_encode_need(p, end, sizeof(u16), bad); \
|
||||
ceph_encode_16(p, v); \
|
||||
} while (0)
|
||||
|
||||
#define ceph_encode_copy_safe(p, end, pv, n, bad) \
|
||||
do { \
|
||||
ceph_encode_need(p, end, n, bad); \
|
||||
ceph_encode_copy(p, pv, n); \
|
||||
} while (0)
|
||||
|
||||
|
||||
#endif
|
1220
fs/ceph/dir.c
Normal file
1220
fs/ceph/dir.c
Normal file
File diff suppressed because it is too large
Load Diff
223
fs/ceph/export.c
Normal file
223
fs/ceph/export.c
Normal file
@ -0,0 +1,223 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/exportfs.h>
|
||||
#include <asm/unaligned.h>
|
||||
|
||||
#include "super.h"
|
||||
|
||||
/*
|
||||
* NFS export support
|
||||
*
|
||||
* NFS re-export of a ceph mount is, at present, only semireliable.
|
||||
* The basic issue is that the Ceph architectures doesn't lend itself
|
||||
* well to generating filehandles that will remain valid forever.
|
||||
*
|
||||
* So, we do our best. If you're lucky, your inode will be in the
|
||||
* client's cache. If it's not, and you have a connectable fh, then
|
||||
* the MDS server may be able to find it for you. Otherwise, you get
|
||||
* ESTALE.
|
||||
*
|
||||
* There are ways to this more reliable, but in the non-connectable fh
|
||||
* case, we won't every work perfectly, and in the connectable case,
|
||||
* some changes are needed on the MDS side to work better.
|
||||
*/
|
||||
|
||||
/*
|
||||
* Basic fh
|
||||
*/
|
||||
struct ceph_nfs_fh {
|
||||
u64 ino;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* Larger 'connectable' fh that includes parent ino and name hash.
|
||||
* Use this whenever possible, as it works more reliably.
|
||||
*/
|
||||
struct ceph_nfs_confh {
|
||||
u64 ino, parent_ino;
|
||||
u32 parent_name_hash;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, int *max_len,
|
||||
int connectable)
|
||||
{
|
||||
struct ceph_nfs_fh *fh = (void *)rawfh;
|
||||
struct ceph_nfs_confh *cfh = (void *)rawfh;
|
||||
struct dentry *parent = dentry->d_parent;
|
||||
struct inode *inode = dentry->d_inode;
|
||||
int type;
|
||||
|
||||
/* don't re-export snaps */
|
||||
if (ceph_snap(inode) != CEPH_NOSNAP)
|
||||
return -EINVAL;
|
||||
|
||||
if (*max_len >= sizeof(*cfh)) {
|
||||
dout("encode_fh %p connectable\n", dentry);
|
||||
cfh->ino = ceph_ino(dentry->d_inode);
|
||||
cfh->parent_ino = ceph_ino(parent->d_inode);
|
||||
cfh->parent_name_hash = parent->d_name.hash;
|
||||
*max_len = sizeof(*cfh);
|
||||
type = 2;
|
||||
} else if (*max_len > sizeof(*fh)) {
|
||||
if (connectable)
|
||||
return -ENOSPC;
|
||||
dout("encode_fh %p\n", dentry);
|
||||
fh->ino = ceph_ino(dentry->d_inode);
|
||||
*max_len = sizeof(*fh);
|
||||
type = 1;
|
||||
} else {
|
||||
return -ENOSPC;
|
||||
}
|
||||
return type;
|
||||
}
|
||||
|
||||
/*
|
||||
* convert regular fh to dentry
|
||||
*
|
||||
* FIXME: we should try harder by querying the mds for the ino.
|
||||
*/
|
||||
static struct dentry *__fh_to_dentry(struct super_block *sb,
|
||||
struct ceph_nfs_fh *fh)
|
||||
{
|
||||
struct inode *inode;
|
||||
struct dentry *dentry;
|
||||
struct ceph_vino vino;
|
||||
int err;
|
||||
|
||||
dout("__fh_to_dentry %llx\n", fh->ino);
|
||||
vino.ino = fh->ino;
|
||||
vino.snap = CEPH_NOSNAP;
|
||||
inode = ceph_find_inode(sb, vino);
|
||||
if (!inode)
|
||||
return ERR_PTR(-ESTALE);
|
||||
|
||||
dentry = d_obtain_alias(inode);
|
||||
if (!dentry) {
|
||||
pr_err("fh_to_dentry %llx -- inode %p but ENOMEM\n",
|
||||
fh->ino, inode);
|
||||
iput(inode);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
err = ceph_init_dentry(dentry);
|
||||
|
||||
if (err < 0) {
|
||||
iput(inode);
|
||||
return ERR_PTR(err);
|
||||
}
|
||||
dout("__fh_to_dentry %llx %p dentry %p\n", fh->ino, inode, dentry);
|
||||
return dentry;
|
||||
}
|
||||
|
||||
/*
|
||||
* convert connectable fh to dentry
|
||||
*/
|
||||
static struct dentry *__cfh_to_dentry(struct super_block *sb,
|
||||
struct ceph_nfs_confh *cfh)
|
||||
{
|
||||
struct ceph_mds_client *mdsc = &ceph_client(sb)->mdsc;
|
||||
struct inode *inode;
|
||||
struct dentry *dentry;
|
||||
struct ceph_vino vino;
|
||||
int err;
|
||||
|
||||
dout("__cfh_to_dentry %llx (%llx/%x)\n",
|
||||
cfh->ino, cfh->parent_ino, cfh->parent_name_hash);
|
||||
|
||||
vino.ino = cfh->ino;
|
||||
vino.snap = CEPH_NOSNAP;
|
||||
inode = ceph_find_inode(sb, vino);
|
||||
if (!inode) {
|
||||
struct ceph_mds_request *req;
|
||||
|
||||
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_LOOKUPHASH,
|
||||
USE_ANY_MDS);
|
||||
if (IS_ERR(req))
|
||||
return ERR_PTR(PTR_ERR(req));
|
||||
|
||||
req->r_ino1 = vino;
|
||||
req->r_ino2.ino = cfh->parent_ino;
|
||||
req->r_ino2.snap = CEPH_NOSNAP;
|
||||
req->r_path2 = kmalloc(16, GFP_NOFS);
|
||||
snprintf(req->r_path2, 16, "%d", cfh->parent_name_hash);
|
||||
req->r_num_caps = 1;
|
||||
err = ceph_mdsc_do_request(mdsc, NULL, req);
|
||||
ceph_mdsc_put_request(req);
|
||||
inode = ceph_find_inode(sb, vino);
|
||||
if (!inode)
|
||||
return ERR_PTR(err ? err : -ESTALE);
|
||||
}
|
||||
|
||||
dentry = d_obtain_alias(inode);
|
||||
if (!dentry) {
|
||||
pr_err("cfh_to_dentry %llx -- inode %p but ENOMEM\n",
|
||||
cfh->ino, inode);
|
||||
iput(inode);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
err = ceph_init_dentry(dentry);
|
||||
if (err < 0) {
|
||||
iput(inode);
|
||||
return ERR_PTR(err);
|
||||
}
|
||||
dout("__cfh_to_dentry %llx %p dentry %p\n", cfh->ino, inode, dentry);
|
||||
return dentry;
|
||||
}
|
||||
|
||||
static struct dentry *ceph_fh_to_dentry(struct super_block *sb, struct fid *fid,
|
||||
int fh_len, int fh_type)
|
||||
{
|
||||
if (fh_type == 1)
|
||||
return __fh_to_dentry(sb, (struct ceph_nfs_fh *)fid->raw);
|
||||
else
|
||||
return __cfh_to_dentry(sb, (struct ceph_nfs_confh *)fid->raw);
|
||||
}
|
||||
|
||||
/*
|
||||
* get parent, if possible.
|
||||
*
|
||||
* FIXME: we could do better by querying the mds to discover the
|
||||
* parent.
|
||||
*/
|
||||
static struct dentry *ceph_fh_to_parent(struct super_block *sb,
|
||||
struct fid *fid,
|
||||
int fh_len, int fh_type)
|
||||
{
|
||||
struct ceph_nfs_confh *cfh = (void *)fid->raw;
|
||||
struct ceph_vino vino;
|
||||
struct inode *inode;
|
||||
struct dentry *dentry;
|
||||
int err;
|
||||
|
||||
if (fh_type == 1)
|
||||
return ERR_PTR(-ESTALE);
|
||||
|
||||
pr_debug("fh_to_parent %llx/%d\n", cfh->parent_ino,
|
||||
cfh->parent_name_hash);
|
||||
|
||||
vino.ino = cfh->ino;
|
||||
vino.snap = CEPH_NOSNAP;
|
||||
inode = ceph_find_inode(sb, vino);
|
||||
if (!inode)
|
||||
return ERR_PTR(-ESTALE);
|
||||
|
||||
dentry = d_obtain_alias(inode);
|
||||
if (!dentry) {
|
||||
pr_err("fh_to_parent %llx -- inode %p but ENOMEM\n",
|
||||
cfh->ino, inode);
|
||||
iput(inode);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
err = ceph_init_dentry(dentry);
|
||||
if (err < 0) {
|
||||
iput(inode);
|
||||
return ERR_PTR(err);
|
||||
}
|
||||
dout("fh_to_parent %llx %p dentry %p\n", cfh->ino, inode, dentry);
|
||||
return dentry;
|
||||
}
|
||||
|
||||
const struct export_operations ceph_export_ops = {
|
||||
.encode_fh = ceph_encode_fh,
|
||||
.fh_to_dentry = ceph_fh_to_dentry,
|
||||
.fh_to_parent = ceph_fh_to_parent,
|
||||
};
|
937
fs/ceph/file.c
Normal file
937
fs/ceph/file.c
Normal file
@ -0,0 +1,937 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/sched.h>
|
||||
#include <linux/file.h>
|
||||
#include <linux/namei.h>
|
||||
#include <linux/writeback.h>
|
||||
|
||||
#include "super.h"
|
||||
#include "mds_client.h"
|
||||
|
||||
/*
|
||||
* Ceph file operations
|
||||
*
|
||||
* Implement basic open/close functionality, and implement
|
||||
* read/write.
|
||||
*
|
||||
* We implement three modes of file I/O:
|
||||
* - buffered uses the generic_file_aio_{read,write} helpers
|
||||
*
|
||||
* - synchronous is used when there is multi-client read/write
|
||||
* sharing, avoids the page cache, and synchronously waits for an
|
||||
* ack from the OSD.
|
||||
*
|
||||
* - direct io takes the variant of the sync path that references
|
||||
* user pages directly.
|
||||
*
|
||||
* fsync() flushes and waits on dirty pages, but just queues metadata
|
||||
* for writeback: since the MDS can recover size and mtime there is no
|
||||
* need to wait for MDS acknowledgement.
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
* Prepare an open request. Preallocate ceph_cap to avoid an
|
||||
* inopportune ENOMEM later.
|
||||
*/
|
||||
static struct ceph_mds_request *
|
||||
prepare_open_request(struct super_block *sb, int flags, int create_mode)
|
||||
{
|
||||
struct ceph_client *client = ceph_sb_to_client(sb);
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct ceph_mds_request *req;
|
||||
int want_auth = USE_ANY_MDS;
|
||||
int op = (flags & O_CREAT) ? CEPH_MDS_OP_CREATE : CEPH_MDS_OP_OPEN;
|
||||
|
||||
if (flags & (O_WRONLY|O_RDWR|O_CREAT|O_TRUNC))
|
||||
want_auth = USE_AUTH_MDS;
|
||||
|
||||
req = ceph_mdsc_create_request(mdsc, op, want_auth);
|
||||
if (IS_ERR(req))
|
||||
goto out;
|
||||
req->r_fmode = ceph_flags_to_mode(flags);
|
||||
req->r_args.open.flags = cpu_to_le32(flags);
|
||||
req->r_args.open.mode = cpu_to_le32(create_mode);
|
||||
req->r_args.open.preferred = cpu_to_le32(-1);
|
||||
out:
|
||||
return req;
|
||||
}
|
||||
|
||||
/*
|
||||
* initialize private struct file data.
|
||||
* if we fail, clean up by dropping fmode reference on the ceph_inode
|
||||
*/
|
||||
static int ceph_init_file(struct inode *inode, struct file *file, int fmode)
|
||||
{
|
||||
struct ceph_file_info *cf;
|
||||
int ret = 0;
|
||||
|
||||
switch (inode->i_mode & S_IFMT) {
|
||||
case S_IFREG:
|
||||
case S_IFDIR:
|
||||
dout("init_file %p %p 0%o (regular)\n", inode, file,
|
||||
inode->i_mode);
|
||||
cf = kmem_cache_alloc(ceph_file_cachep, GFP_NOFS | __GFP_ZERO);
|
||||
if (cf == NULL) {
|
||||
ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
|
||||
return -ENOMEM;
|
||||
}
|
||||
cf->fmode = fmode;
|
||||
cf->next_offset = 2;
|
||||
file->private_data = cf;
|
||||
BUG_ON(inode->i_fop->release != ceph_release);
|
||||
break;
|
||||
|
||||
case S_IFLNK:
|
||||
dout("init_file %p %p 0%o (symlink)\n", inode, file,
|
||||
inode->i_mode);
|
||||
ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
|
||||
break;
|
||||
|
||||
default:
|
||||
dout("init_file %p %p 0%o (special)\n", inode, file,
|
||||
inode->i_mode);
|
||||
/*
|
||||
* we need to drop the open ref now, since we don't
|
||||
* have .release set to ceph_release.
|
||||
*/
|
||||
ceph_put_fmode(ceph_inode(inode), fmode); /* clean up */
|
||||
BUG_ON(inode->i_fop->release == ceph_release);
|
||||
|
||||
/* call the proper open fop */
|
||||
ret = inode->i_fop->open(inode, file);
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* If the filp already has private_data, that means the file was
|
||||
* already opened by intent during lookup, and we do nothing.
|
||||
*
|
||||
* If we already have the requisite capabilities, we can satisfy
|
||||
* the open request locally (no need to request new caps from the
|
||||
* MDS). We do, however, need to inform the MDS (asynchronously)
|
||||
* if our wanted caps set expands.
|
||||
*/
|
||||
int ceph_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_client *client = ceph_sb_to_client(inode->i_sb);
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct ceph_mds_request *req;
|
||||
struct ceph_file_info *cf = file->private_data;
|
||||
struct inode *parent_inode = file->f_dentry->d_parent->d_inode;
|
||||
int err;
|
||||
int flags, fmode, wanted;
|
||||
|
||||
if (cf) {
|
||||
dout("open file %p is already opened\n", file);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* filter out O_CREAT|O_EXCL; vfs did that already. yuck. */
|
||||
flags = file->f_flags & ~(O_CREAT|O_EXCL);
|
||||
if (S_ISDIR(inode->i_mode))
|
||||
flags = O_DIRECTORY; /* mds likes to know */
|
||||
|
||||
dout("open inode %p ino %llx.%llx file %p flags %d (%d)\n", inode,
|
||||
ceph_vinop(inode), file, flags, file->f_flags);
|
||||
fmode = ceph_flags_to_mode(flags);
|
||||
wanted = ceph_caps_for_mode(fmode);
|
||||
|
||||
/* snapped files are read-only */
|
||||
if (ceph_snap(inode) != CEPH_NOSNAP && (file->f_mode & FMODE_WRITE))
|
||||
return -EROFS;
|
||||
|
||||
/* trivially open snapdir */
|
||||
if (ceph_snap(inode) == CEPH_SNAPDIR) {
|
||||
spin_lock(&inode->i_lock);
|
||||
__ceph_get_fmode(ci, fmode);
|
||||
spin_unlock(&inode->i_lock);
|
||||
return ceph_init_file(inode, file, fmode);
|
||||
}
|
||||
|
||||
/*
|
||||
* No need to block if we have any caps. Update wanted set
|
||||
* asynchronously.
|
||||
*/
|
||||
spin_lock(&inode->i_lock);
|
||||
if (__ceph_is_any_real_caps(ci)) {
|
||||
int mds_wanted = __ceph_caps_mds_wanted(ci);
|
||||
int issued = __ceph_caps_issued(ci, NULL);
|
||||
|
||||
dout("open %p fmode %d want %s issued %s using existing\n",
|
||||
inode, fmode, ceph_cap_string(wanted),
|
||||
ceph_cap_string(issued));
|
||||
__ceph_get_fmode(ci, fmode);
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
/* adjust wanted? */
|
||||
if ((issued & wanted) != wanted &&
|
||||
(mds_wanted & wanted) != wanted &&
|
||||
ceph_snap(inode) != CEPH_SNAPDIR)
|
||||
ceph_check_caps(ci, 0, NULL);
|
||||
|
||||
return ceph_init_file(inode, file, fmode);
|
||||
} else if (ceph_snap(inode) != CEPH_NOSNAP &&
|
||||
(ci->i_snap_caps & wanted) == wanted) {
|
||||
__ceph_get_fmode(ci, fmode);
|
||||
spin_unlock(&inode->i_lock);
|
||||
return ceph_init_file(inode, file, fmode);
|
||||
}
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
dout("open fmode %d wants %s\n", fmode, ceph_cap_string(wanted));
|
||||
req = prepare_open_request(inode->i_sb, flags, 0);
|
||||
if (IS_ERR(req)) {
|
||||
err = PTR_ERR(req);
|
||||
goto out;
|
||||
}
|
||||
req->r_inode = igrab(inode);
|
||||
req->r_num_caps = 1;
|
||||
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
|
||||
if (!err)
|
||||
err = ceph_init_file(inode, file, req->r_fmode);
|
||||
ceph_mdsc_put_request(req);
|
||||
dout("open result=%d on %llx.%llx\n", err, ceph_vinop(inode));
|
||||
out:
|
||||
return err;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Do a lookup + open with a single request.
|
||||
*
|
||||
* If this succeeds, but some subsequent check in the vfs
|
||||
* may_open() fails, the struct *file gets cleaned up (i.e.
|
||||
* ceph_release gets called). So fear not!
|
||||
*/
|
||||
/*
|
||||
* flags
|
||||
* path_lookup_open -> LOOKUP_OPEN
|
||||
* path_lookup_create -> LOOKUP_OPEN|LOOKUP_CREATE
|
||||
*/
|
||||
struct dentry *ceph_lookup_open(struct inode *dir, struct dentry *dentry,
|
||||
struct nameidata *nd, int mode,
|
||||
int locked_dir)
|
||||
{
|
||||
struct ceph_client *client = ceph_sb_to_client(dir->i_sb);
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct file *file = nd->intent.open.file;
|
||||
struct inode *parent_inode = get_dentry_parent_inode(file->f_dentry);
|
||||
struct ceph_mds_request *req;
|
||||
int err;
|
||||
int flags = nd->intent.open.flags - 1; /* silly vfs! */
|
||||
|
||||
dout("ceph_lookup_open dentry %p '%.*s' flags %d mode 0%o\n",
|
||||
dentry, dentry->d_name.len, dentry->d_name.name, flags, mode);
|
||||
|
||||
/* do the open */
|
||||
req = prepare_open_request(dir->i_sb, flags, mode);
|
||||
if (IS_ERR(req))
|
||||
return ERR_PTR(PTR_ERR(req));
|
||||
req->r_dentry = dget(dentry);
|
||||
req->r_num_caps = 2;
|
||||
if (flags & O_CREAT) {
|
||||
req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
|
||||
req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
|
||||
}
|
||||
req->r_locked_dir = dir; /* caller holds dir->i_mutex */
|
||||
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
|
||||
dentry = ceph_finish_lookup(req, dentry, err);
|
||||
if (!err && (flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
|
||||
err = ceph_handle_notrace_create(dir, dentry);
|
||||
if (!err)
|
||||
err = ceph_init_file(req->r_dentry->d_inode, file,
|
||||
req->r_fmode);
|
||||
ceph_mdsc_put_request(req);
|
||||
dout("ceph_lookup_open result=%p\n", dentry);
|
||||
return dentry;
|
||||
}
|
||||
|
||||
int ceph_release(struct inode *inode, struct file *file)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_file_info *cf = file->private_data;
|
||||
|
||||
dout("release inode %p file %p\n", inode, file);
|
||||
ceph_put_fmode(ci, cf->fmode);
|
||||
if (cf->last_readdir)
|
||||
ceph_mdsc_put_request(cf->last_readdir);
|
||||
kfree(cf->last_name);
|
||||
kfree(cf->dir_info);
|
||||
dput(cf->dentry);
|
||||
kmem_cache_free(ceph_file_cachep, cf);
|
||||
|
||||
/* wake up anyone waiting for caps on this inode */
|
||||
wake_up(&ci->i_cap_wq);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* build a vector of user pages
|
||||
*/
|
||||
static struct page **get_direct_page_vector(const char __user *data,
|
||||
int num_pages,
|
||||
loff_t off, size_t len)
|
||||
{
|
||||
struct page **pages;
|
||||
int rc;
|
||||
|
||||
pages = kmalloc(sizeof(*pages) * num_pages, GFP_NOFS);
|
||||
if (!pages)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
down_read(¤t->mm->mmap_sem);
|
||||
rc = get_user_pages(current, current->mm, (unsigned long)data,
|
||||
num_pages, 0, 0, pages, NULL);
|
||||
up_read(¤t->mm->mmap_sem);
|
||||
if (rc < 0)
|
||||
goto fail;
|
||||
return pages;
|
||||
|
||||
fail:
|
||||
kfree(pages);
|
||||
return ERR_PTR(rc);
|
||||
}
|
||||
|
||||
static void put_page_vector(struct page **pages, int num_pages)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < num_pages; i++)
|
||||
put_page(pages[i]);
|
||||
kfree(pages);
|
||||
}
|
||||
|
||||
void ceph_release_page_vector(struct page **pages, int num_pages)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < num_pages; i++)
|
||||
__free_pages(pages[i], 0);
|
||||
kfree(pages);
|
||||
}
|
||||
|
||||
/*
|
||||
* allocate a vector new pages
|
||||
*/
|
||||
static struct page **alloc_page_vector(int num_pages)
|
||||
{
|
||||
struct page **pages;
|
||||
int i;
|
||||
|
||||
pages = kmalloc(sizeof(*pages) * num_pages, GFP_NOFS);
|
||||
if (!pages)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
for (i = 0; i < num_pages; i++) {
|
||||
pages[i] = alloc_page(GFP_NOFS);
|
||||
if (pages[i] == NULL) {
|
||||
ceph_release_page_vector(pages, i);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
}
|
||||
return pages;
|
||||
}
|
||||
|
||||
/*
|
||||
* copy user data into a page vector
|
||||
*/
|
||||
static int copy_user_to_page_vector(struct page **pages,
|
||||
const char __user *data,
|
||||
loff_t off, size_t len)
|
||||
{
|
||||
int i = 0;
|
||||
int po = off & ~PAGE_CACHE_MASK;
|
||||
int left = len;
|
||||
int l, bad;
|
||||
|
||||
while (left > 0) {
|
||||
l = min_t(int, PAGE_CACHE_SIZE-po, left);
|
||||
bad = copy_from_user(page_address(pages[i]) + po, data, l);
|
||||
if (bad == l)
|
||||
return -EFAULT;
|
||||
data += l - bad;
|
||||
left -= l - bad;
|
||||
po += l - bad;
|
||||
if (po == PAGE_CACHE_SIZE) {
|
||||
po = 0;
|
||||
i++;
|
||||
}
|
||||
}
|
||||
return len;
|
||||
}
|
||||
|
||||
/*
|
||||
* copy user data from a page vector into a user pointer
|
||||
*/
|
||||
static int copy_page_vector_to_user(struct page **pages, char __user *data,
|
||||
loff_t off, size_t len)
|
||||
{
|
||||
int i = 0;
|
||||
int po = off & ~PAGE_CACHE_MASK;
|
||||
int left = len;
|
||||
int l, bad;
|
||||
|
||||
while (left > 0) {
|
||||
l = min_t(int, left, PAGE_CACHE_SIZE-po);
|
||||
bad = copy_to_user(data, page_address(pages[i]) + po, l);
|
||||
if (bad == l)
|
||||
return -EFAULT;
|
||||
data += l - bad;
|
||||
left -= l - bad;
|
||||
if (po) {
|
||||
po += l - bad;
|
||||
if (po == PAGE_CACHE_SIZE)
|
||||
po = 0;
|
||||
}
|
||||
i++;
|
||||
}
|
||||
return len;
|
||||
}
|
||||
|
||||
/*
|
||||
* Zero an extent within a page vector. Offset is relative to the
|
||||
* start of the first page.
|
||||
*/
|
||||
static void zero_page_vector_range(int off, int len, struct page **pages)
|
||||
{
|
||||
int i = off >> PAGE_CACHE_SHIFT;
|
||||
|
||||
off &= ~PAGE_CACHE_MASK;
|
||||
|
||||
dout("zero_page_vector_page %u~%u\n", off, len);
|
||||
|
||||
/* leading partial page? */
|
||||
if (off) {
|
||||
int end = min((int)PAGE_CACHE_SIZE, off + len);
|
||||
dout("zeroing %d %p head from %d\n", i, pages[i],
|
||||
(int)off);
|
||||
zero_user_segment(pages[i], off, end);
|
||||
len -= (end - off);
|
||||
i++;
|
||||
}
|
||||
while (len >= PAGE_CACHE_SIZE) {
|
||||
dout("zeroing %d %p len=%d\n", i, pages[i], len);
|
||||
zero_user_segment(pages[i], 0, PAGE_CACHE_SIZE);
|
||||
len -= PAGE_CACHE_SIZE;
|
||||
i++;
|
||||
}
|
||||
/* trailing partial page? */
|
||||
if (len) {
|
||||
dout("zeroing %d %p tail to %d\n", i, pages[i], (int)len);
|
||||
zero_user_segment(pages[i], 0, len);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Read a range of bytes striped over one or more objects. Iterate over
|
||||
* objects we stripe over. (That's not atomic, but good enough for now.)
|
||||
*
|
||||
* If we get a short result from the OSD, check against i_size; we need to
|
||||
* only return a short read to the caller if we hit EOF.
|
||||
*/
|
||||
static int striped_read(struct inode *inode,
|
||||
u64 off, u64 len,
|
||||
struct page **pages, int num_pages,
|
||||
int *checkeof)
|
||||
{
|
||||
struct ceph_client *client = ceph_inode_to_client(inode);
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
u64 pos, this_len;
|
||||
int page_off = off & ~PAGE_CACHE_MASK; /* first byte's offset in page */
|
||||
int left, pages_left;
|
||||
int read;
|
||||
struct page **page_pos;
|
||||
int ret;
|
||||
bool hit_stripe, was_short;
|
||||
|
||||
/*
|
||||
* we may need to do multiple reads. not atomic, unfortunately.
|
||||
*/
|
||||
pos = off;
|
||||
left = len;
|
||||
page_pos = pages;
|
||||
pages_left = num_pages;
|
||||
read = 0;
|
||||
|
||||
more:
|
||||
this_len = left;
|
||||
ret = ceph_osdc_readpages(&client->osdc, ceph_vino(inode),
|
||||
&ci->i_layout, pos, &this_len,
|
||||
ci->i_truncate_seq,
|
||||
ci->i_truncate_size,
|
||||
page_pos, pages_left);
|
||||
hit_stripe = this_len < left;
|
||||
was_short = ret >= 0 && ret < this_len;
|
||||
if (ret == -ENOENT)
|
||||
ret = 0;
|
||||
dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
|
||||
ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
|
||||
|
||||
if (ret > 0) {
|
||||
int didpages =
|
||||
((pos & ~PAGE_CACHE_MASK) + ret) >> PAGE_CACHE_SHIFT;
|
||||
|
||||
if (read < pos - off) {
|
||||
dout(" zero gap %llu to %llu\n", off + read, pos);
|
||||
zero_page_vector_range(page_off + read,
|
||||
pos - off - read, pages);
|
||||
}
|
||||
pos += ret;
|
||||
read = pos - off;
|
||||
left -= ret;
|
||||
page_pos += didpages;
|
||||
pages_left -= didpages;
|
||||
|
||||
/* hit stripe? */
|
||||
if (left && hit_stripe)
|
||||
goto more;
|
||||
}
|
||||
|
||||
if (was_short) {
|
||||
/* was original extent fully inside i_size? */
|
||||
if (pos + left <= inode->i_size) {
|
||||
dout("zero tail\n");
|
||||
zero_page_vector_range(page_off + read, len - read,
|
||||
pages);
|
||||
read = len;
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* check i_size */
|
||||
*checkeof = 1;
|
||||
}
|
||||
|
||||
out:
|
||||
if (ret >= 0)
|
||||
ret = read;
|
||||
dout("striped_read returns %d\n", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Completely synchronous read and write methods. Direct from __user
|
||||
* buffer to osd, or directly to user pages (if O_DIRECT).
|
||||
*
|
||||
* If the read spans object boundary, just do multiple reads.
|
||||
*/
|
||||
static ssize_t ceph_sync_read(struct file *file, char __user *data,
|
||||
unsigned len, loff_t *poff, int *checkeof)
|
||||
{
|
||||
struct inode *inode = file->f_dentry->d_inode;
|
||||
struct page **pages;
|
||||
u64 off = *poff;
|
||||
int num_pages = calc_pages_for(off, len);
|
||||
int ret;
|
||||
|
||||
dout("sync_read on file %p %llu~%u %s\n", file, off, len,
|
||||
(file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
|
||||
|
||||
if (file->f_flags & O_DIRECT) {
|
||||
pages = get_direct_page_vector(data, num_pages, off, len);
|
||||
|
||||
/*
|
||||
* flush any page cache pages in this range. this
|
||||
* will make concurrent normal and O_DIRECT io slow,
|
||||
* but it will at least behave sensibly when they are
|
||||
* in sequence.
|
||||
*/
|
||||
} else {
|
||||
pages = alloc_page_vector(num_pages);
|
||||
}
|
||||
if (IS_ERR(pages))
|
||||
return PTR_ERR(pages);
|
||||
|
||||
ret = filemap_write_and_wait(inode->i_mapping);
|
||||
if (ret < 0)
|
||||
goto done;
|
||||
|
||||
ret = striped_read(inode, off, len, pages, num_pages, checkeof);
|
||||
|
||||
if (ret >= 0 && (file->f_flags & O_DIRECT) == 0)
|
||||
ret = copy_page_vector_to_user(pages, data, off, ret);
|
||||
if (ret >= 0)
|
||||
*poff = off + ret;
|
||||
|
||||
done:
|
||||
if (file->f_flags & O_DIRECT)
|
||||
put_page_vector(pages, num_pages);
|
||||
else
|
||||
ceph_release_page_vector(pages, num_pages);
|
||||
dout("sync_read result %d\n", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Write commit callback, called if we requested both an ACK and
|
||||
* ONDISK commit reply from the OSD.
|
||||
*/
|
||||
static void sync_write_commit(struct ceph_osd_request *req,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(req->r_inode);
|
||||
|
||||
dout("sync_write_commit %p tid %llu\n", req, req->r_tid);
|
||||
spin_lock(&ci->i_unsafe_lock);
|
||||
list_del_init(&req->r_unsafe_item);
|
||||
spin_unlock(&ci->i_unsafe_lock);
|
||||
ceph_put_cap_refs(ci, CEPH_CAP_FILE_WR);
|
||||
}
|
||||
|
||||
/*
|
||||
* Synchronous write, straight from __user pointer or user pages (if
|
||||
* O_DIRECT).
|
||||
*
|
||||
* If write spans object boundary, just do multiple writes. (For a
|
||||
* correct atomic write, we should e.g. take write locks on all
|
||||
* objects, rollback on failure, etc.)
|
||||
*/
|
||||
static ssize_t ceph_sync_write(struct file *file, const char __user *data,
|
||||
size_t left, loff_t *offset)
|
||||
{
|
||||
struct inode *inode = file->f_dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_client *client = ceph_inode_to_client(inode);
|
||||
struct ceph_osd_request *req;
|
||||
struct page **pages;
|
||||
int num_pages;
|
||||
long long unsigned pos;
|
||||
u64 len;
|
||||
int written = 0;
|
||||
int flags;
|
||||
int do_sync = 0;
|
||||
int check_caps = 0;
|
||||
int ret;
|
||||
struct timespec mtime = CURRENT_TIME;
|
||||
|
||||
if (ceph_snap(file->f_dentry->d_inode) != CEPH_NOSNAP)
|
||||
return -EROFS;
|
||||
|
||||
dout("sync_write on file %p %lld~%u %s\n", file, *offset,
|
||||
(unsigned)left, (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
|
||||
|
||||
if (file->f_flags & O_APPEND)
|
||||
pos = i_size_read(inode);
|
||||
else
|
||||
pos = *offset;
|
||||
|
||||
ret = filemap_write_and_wait_range(inode->i_mapping, pos, pos + left);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
ret = invalidate_inode_pages2_range(inode->i_mapping,
|
||||
pos >> PAGE_CACHE_SHIFT,
|
||||
(pos + left) >> PAGE_CACHE_SHIFT);
|
||||
if (ret < 0)
|
||||
dout("invalidate_inode_pages2_range returned %d\n", ret);
|
||||
|
||||
flags = CEPH_OSD_FLAG_ORDERSNAP |
|
||||
CEPH_OSD_FLAG_ONDISK |
|
||||
CEPH_OSD_FLAG_WRITE;
|
||||
if ((file->f_flags & (O_SYNC|O_DIRECT)) == 0)
|
||||
flags |= CEPH_OSD_FLAG_ACK;
|
||||
else
|
||||
do_sync = 1;
|
||||
|
||||
/*
|
||||
* we may need to do multiple writes here if we span an object
|
||||
* boundary. this isn't atomic, unfortunately. :(
|
||||
*/
|
||||
more:
|
||||
len = left;
|
||||
req = ceph_osdc_new_request(&client->osdc, &ci->i_layout,
|
||||
ceph_vino(inode), pos, &len,
|
||||
CEPH_OSD_OP_WRITE, flags,
|
||||
ci->i_snap_realm->cached_context,
|
||||
do_sync,
|
||||
ci->i_truncate_seq, ci->i_truncate_size,
|
||||
&mtime, false, 2);
|
||||
if (IS_ERR(req))
|
||||
return PTR_ERR(req);
|
||||
|
||||
num_pages = calc_pages_for(pos, len);
|
||||
|
||||
if (file->f_flags & O_DIRECT) {
|
||||
pages = get_direct_page_vector(data, num_pages, pos, len);
|
||||
if (IS_ERR(pages)) {
|
||||
ret = PTR_ERR(pages);
|
||||
goto out;
|
||||
}
|
||||
|
||||
/*
|
||||
* throw out any page cache pages in this range. this
|
||||
* may block.
|
||||
*/
|
||||
truncate_inode_pages_range(inode->i_mapping, pos, pos+len);
|
||||
} else {
|
||||
pages = alloc_page_vector(num_pages);
|
||||
if (IS_ERR(pages)) {
|
||||
ret = PTR_ERR(pages);
|
||||
goto out;
|
||||
}
|
||||
ret = copy_user_to_page_vector(pages, data, pos, len);
|
||||
if (ret < 0) {
|
||||
ceph_release_page_vector(pages, num_pages);
|
||||
goto out;
|
||||
}
|
||||
|
||||
if ((file->f_flags & O_SYNC) == 0) {
|
||||
/* get a second commit callback */
|
||||
req->r_safe_callback = sync_write_commit;
|
||||
req->r_own_pages = 1;
|
||||
}
|
||||
}
|
||||
req->r_pages = pages;
|
||||
req->r_num_pages = num_pages;
|
||||
req->r_inode = inode;
|
||||
|
||||
ret = ceph_osdc_start_request(&client->osdc, req, false);
|
||||
if (!ret) {
|
||||
if (req->r_safe_callback) {
|
||||
/*
|
||||
* Add to inode unsafe list only after we
|
||||
* start_request so that a tid has been assigned.
|
||||
*/
|
||||
spin_lock(&ci->i_unsafe_lock);
|
||||
list_add(&ci->i_unsafe_writes, &req->r_unsafe_item);
|
||||
spin_unlock(&ci->i_unsafe_lock);
|
||||
ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR);
|
||||
}
|
||||
ret = ceph_osdc_wait_request(&client->osdc, req);
|
||||
}
|
||||
|
||||
if (file->f_flags & O_DIRECT)
|
||||
put_page_vector(pages, num_pages);
|
||||
else if (file->f_flags & O_SYNC)
|
||||
ceph_release_page_vector(pages, num_pages);
|
||||
|
||||
out:
|
||||
ceph_osdc_put_request(req);
|
||||
if (ret == 0) {
|
||||
pos += len;
|
||||
written += len;
|
||||
left -= len;
|
||||
if (left)
|
||||
goto more;
|
||||
|
||||
ret = written;
|
||||
*offset = pos;
|
||||
if (pos > i_size_read(inode))
|
||||
check_caps = ceph_inode_set_size(inode, pos);
|
||||
if (check_caps)
|
||||
ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY,
|
||||
NULL);
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Wrap generic_file_aio_read with checks for cap bits on the inode.
|
||||
* Atomically grab references, so that those bits are not released
|
||||
* back to the MDS mid-read.
|
||||
*
|
||||
* Hmm, the sync read case isn't actually async... should it be?
|
||||
*/
|
||||
static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov,
|
||||
unsigned long nr_segs, loff_t pos)
|
||||
{
|
||||
struct file *filp = iocb->ki_filp;
|
||||
loff_t *ppos = &iocb->ki_pos;
|
||||
size_t len = iov->iov_len;
|
||||
struct inode *inode = filp->f_dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
void *base = iov->iov_base;
|
||||
ssize_t ret;
|
||||
int got = 0;
|
||||
int checkeof = 0, read = 0;
|
||||
|
||||
dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)len, inode);
|
||||
again:
|
||||
__ceph_do_pending_vmtruncate(inode);
|
||||
ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, CEPH_CAP_FILE_CACHE,
|
||||
&got, -1);
|
||||
if (ret < 0)
|
||||
goto out;
|
||||
dout("aio_read %p %llx.%llx %llu~%u got cap refs on %s\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)len,
|
||||
ceph_cap_string(got));
|
||||
|
||||
if ((got & CEPH_CAP_FILE_CACHE) == 0 ||
|
||||
(iocb->ki_filp->f_flags & O_DIRECT) ||
|
||||
(inode->i_sb->s_flags & MS_SYNCHRONOUS))
|
||||
/* hmm, this isn't really async... */
|
||||
ret = ceph_sync_read(filp, base, len, ppos, &checkeof);
|
||||
else
|
||||
ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
|
||||
|
||||
out:
|
||||
dout("aio_read %p %llx.%llx dropping cap refs on %s = %d\n",
|
||||
inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
|
||||
ceph_put_cap_refs(ci, got);
|
||||
|
||||
if (checkeof && ret >= 0) {
|
||||
int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
|
||||
|
||||
/* hit EOF or hole? */
|
||||
if (statret == 0 && *ppos < inode->i_size) {
|
||||
dout("aio_read sync_read hit hole, reading more\n");
|
||||
read += ret;
|
||||
base += ret;
|
||||
len -= ret;
|
||||
checkeof = 0;
|
||||
goto again;
|
||||
}
|
||||
}
|
||||
if (ret >= 0)
|
||||
ret += read;
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Take cap references to avoid releasing caps to MDS mid-write.
|
||||
*
|
||||
* If we are synchronous, and write with an old snap context, the OSD
|
||||
* may return EOLDSNAPC. In that case, retry the write.. _after_
|
||||
* dropping our cap refs and allowing the pending snap to logically
|
||||
* complete _before_ this write occurs.
|
||||
*
|
||||
* If we are near ENOSPC, write synchronously.
|
||||
*/
|
||||
static ssize_t ceph_aio_write(struct kiocb *iocb, const struct iovec *iov,
|
||||
unsigned long nr_segs, loff_t pos)
|
||||
{
|
||||
struct file *file = iocb->ki_filp;
|
||||
struct inode *inode = file->f_dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_osd_client *osdc = &ceph_client(inode->i_sb)->osdc;
|
||||
loff_t endoff = pos + iov->iov_len;
|
||||
int got = 0;
|
||||
int ret, err;
|
||||
|
||||
if (ceph_snap(inode) != CEPH_NOSNAP)
|
||||
return -EROFS;
|
||||
|
||||
retry_snap:
|
||||
if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
|
||||
return -ENOSPC;
|
||||
__ceph_do_pending_vmtruncate(inode);
|
||||
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
|
||||
inode->i_size);
|
||||
ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, CEPH_CAP_FILE_BUFFER,
|
||||
&got, endoff);
|
||||
if (ret < 0)
|
||||
goto out;
|
||||
|
||||
dout("aio_write %p %llx.%llx %llu~%u got cap refs on %s\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
|
||||
ceph_cap_string(got));
|
||||
|
||||
if ((got & CEPH_CAP_FILE_BUFFER) == 0 ||
|
||||
(iocb->ki_filp->f_flags & O_DIRECT) ||
|
||||
(inode->i_sb->s_flags & MS_SYNCHRONOUS)) {
|
||||
ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
|
||||
&iocb->ki_pos);
|
||||
} else {
|
||||
ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
|
||||
|
||||
if ((ret >= 0 || ret == -EIOCBQUEUED) &&
|
||||
((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
|
||||
|| ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
|
||||
err = vfs_fsync_range(file, file->f_path.dentry,
|
||||
pos, pos + ret - 1, 1);
|
||||
if (err < 0)
|
||||
ret = err;
|
||||
}
|
||||
}
|
||||
if (ret >= 0) {
|
||||
spin_lock(&inode->i_lock);
|
||||
__ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
|
||||
out:
|
||||
dout("aio_write %p %llx.%llx %llu~%u dropping cap refs on %s\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
|
||||
ceph_cap_string(got));
|
||||
ceph_put_cap_refs(ci, got);
|
||||
|
||||
if (ret == -EOLDSNAPC) {
|
||||
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
|
||||
inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
|
||||
goto retry_snap;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* llseek. be sure to verify file size on SEEK_END.
|
||||
*/
|
||||
static loff_t ceph_llseek(struct file *file, loff_t offset, int origin)
|
||||
{
|
||||
struct inode *inode = file->f_mapping->host;
|
||||
int ret;
|
||||
|
||||
mutex_lock(&inode->i_mutex);
|
||||
__ceph_do_pending_vmtruncate(inode);
|
||||
switch (origin) {
|
||||
case SEEK_END:
|
||||
ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
|
||||
if (ret < 0) {
|
||||
offset = ret;
|
||||
goto out;
|
||||
}
|
||||
offset += inode->i_size;
|
||||
break;
|
||||
case SEEK_CUR:
|
||||
/*
|
||||
* Here we special-case the lseek(fd, 0, SEEK_CUR)
|
||||
* position-querying operation. Avoid rewriting the "same"
|
||||
* f_pos value back to the file because a concurrent read(),
|
||||
* write() or lseek() might have altered it
|
||||
*/
|
||||
if (offset == 0) {
|
||||
offset = file->f_pos;
|
||||
goto out;
|
||||
}
|
||||
offset += file->f_pos;
|
||||
break;
|
||||
}
|
||||
|
||||
if (offset < 0 || offset > inode->i_sb->s_maxbytes) {
|
||||
offset = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* Special lock needed here? */
|
||||
if (offset != file->f_pos) {
|
||||
file->f_pos = offset;
|
||||
file->f_version = 0;
|
||||
}
|
||||
|
||||
out:
|
||||
mutex_unlock(&inode->i_mutex);
|
||||
return offset;
|
||||
}
|
||||
|
||||
const struct file_operations ceph_file_fops = {
|
||||
.open = ceph_open,
|
||||
.release = ceph_release,
|
||||
.llseek = ceph_llseek,
|
||||
.read = do_sync_read,
|
||||
.write = do_sync_write,
|
||||
.aio_read = ceph_aio_read,
|
||||
.aio_write = ceph_aio_write,
|
||||
.mmap = ceph_mmap,
|
||||
.fsync = ceph_fsync,
|
||||
.splice_read = generic_file_splice_read,
|
||||
.splice_write = generic_file_splice_write,
|
||||
.unlocked_ioctl = ceph_ioctl,
|
||||
.compat_ioctl = ceph_ioctl,
|
||||
};
|
||||
|
1750
fs/ceph/inode.c
Normal file
1750
fs/ceph/inode.c
Normal file
File diff suppressed because it is too large
Load Diff
160
fs/ceph/ioctl.c
Normal file
160
fs/ceph/ioctl.c
Normal file
@ -0,0 +1,160 @@
|
||||
#include <linux/in.h>
|
||||
|
||||
#include "ioctl.h"
|
||||
#include "super.h"
|
||||
#include "ceph_debug.h"
|
||||
|
||||
|
||||
/*
|
||||
* ioctls
|
||||
*/
|
||||
|
||||
/*
|
||||
* get and set the file layout
|
||||
*/
|
||||
static long ceph_ioctl_get_layout(struct file *file, void __user *arg)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(file->f_dentry->d_inode);
|
||||
struct ceph_ioctl_layout l;
|
||||
int err;
|
||||
|
||||
err = ceph_do_getattr(file->f_dentry->d_inode, CEPH_STAT_CAP_LAYOUT);
|
||||
if (!err) {
|
||||
l.stripe_unit = ceph_file_layout_su(ci->i_layout);
|
||||
l.stripe_count = ceph_file_layout_stripe_count(ci->i_layout);
|
||||
l.object_size = ceph_file_layout_object_size(ci->i_layout);
|
||||
l.data_pool = le32_to_cpu(ci->i_layout.fl_pg_pool);
|
||||
l.preferred_osd =
|
||||
(s32)le32_to_cpu(ci->i_layout.fl_pg_preferred);
|
||||
if (copy_to_user(arg, &l, sizeof(l)))
|
||||
return -EFAULT;
|
||||
}
|
||||
|
||||
return err;
|
||||
}
|
||||
|
||||
static long ceph_ioctl_set_layout(struct file *file, void __user *arg)
|
||||
{
|
||||
struct inode *inode = file->f_dentry->d_inode;
|
||||
struct inode *parent_inode = file->f_dentry->d_parent->d_inode;
|
||||
struct ceph_mds_client *mdsc = &ceph_sb_to_client(inode->i_sb)->mdsc;
|
||||
struct ceph_mds_request *req;
|
||||
struct ceph_ioctl_layout l;
|
||||
int err, i;
|
||||
|
||||
/* copy and validate */
|
||||
if (copy_from_user(&l, arg, sizeof(l)))
|
||||
return -EFAULT;
|
||||
|
||||
if ((l.object_size & ~PAGE_MASK) ||
|
||||
(l.stripe_unit & ~PAGE_MASK) ||
|
||||
!l.stripe_unit ||
|
||||
(l.object_size &&
|
||||
(unsigned)l.object_size % (unsigned)l.stripe_unit))
|
||||
return -EINVAL;
|
||||
|
||||
/* make sure it's a valid data pool */
|
||||
if (l.data_pool > 0) {
|
||||
mutex_lock(&mdsc->mutex);
|
||||
err = -EINVAL;
|
||||
for (i = 0; i < mdsc->mdsmap->m_num_data_pg_pools; i++)
|
||||
if (mdsc->mdsmap->m_data_pg_pools[i] == l.data_pool) {
|
||||
err = 0;
|
||||
break;
|
||||
}
|
||||
mutex_unlock(&mdsc->mutex);
|
||||
if (err)
|
||||
return err;
|
||||
}
|
||||
|
||||
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_SETLAYOUT,
|
||||
USE_AUTH_MDS);
|
||||
if (IS_ERR(req))
|
||||
return PTR_ERR(req);
|
||||
req->r_inode = igrab(inode);
|
||||
req->r_inode_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL;
|
||||
|
||||
req->r_args.setlayout.layout.fl_stripe_unit =
|
||||
cpu_to_le32(l.stripe_unit);
|
||||
req->r_args.setlayout.layout.fl_stripe_count =
|
||||
cpu_to_le32(l.stripe_count);
|
||||
req->r_args.setlayout.layout.fl_object_size =
|
||||
cpu_to_le32(l.object_size);
|
||||
req->r_args.setlayout.layout.fl_pg_pool = cpu_to_le32(l.data_pool);
|
||||
req->r_args.setlayout.layout.fl_pg_preferred =
|
||||
cpu_to_le32(l.preferred_osd);
|
||||
|
||||
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
|
||||
ceph_mdsc_put_request(req);
|
||||
return err;
|
||||
}
|
||||
|
||||
/*
|
||||
* Return object name, size/offset information, and location (OSD
|
||||
* number, network address) for a given file offset.
|
||||
*/
|
||||
static long ceph_ioctl_get_dataloc(struct file *file, void __user *arg)
|
||||
{
|
||||
struct ceph_ioctl_dataloc dl;
|
||||
struct inode *inode = file->f_dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_osd_client *osdc = &ceph_client(inode->i_sb)->osdc;
|
||||
u64 len = 1, olen;
|
||||
u64 tmp;
|
||||
struct ceph_object_layout ol;
|
||||
struct ceph_pg pgid;
|
||||
|
||||
/* copy and validate */
|
||||
if (copy_from_user(&dl, arg, sizeof(dl)))
|
||||
return -EFAULT;
|
||||
|
||||
down_read(&osdc->map_sem);
|
||||
ceph_calc_file_object_mapping(&ci->i_layout, dl.file_offset, &len,
|
||||
&dl.object_no, &dl.object_offset, &olen);
|
||||
dl.file_offset -= dl.object_offset;
|
||||
dl.object_size = ceph_file_layout_object_size(ci->i_layout);
|
||||
dl.block_size = ceph_file_layout_su(ci->i_layout);
|
||||
|
||||
/* block_offset = object_offset % block_size */
|
||||
tmp = dl.object_offset;
|
||||
dl.block_offset = do_div(tmp, dl.block_size);
|
||||
|
||||
snprintf(dl.object_name, sizeof(dl.object_name), "%llx.%08llx",
|
||||
ceph_ino(inode), dl.object_no);
|
||||
ceph_calc_object_layout(&ol, dl.object_name, &ci->i_layout,
|
||||
osdc->osdmap);
|
||||
|
||||
pgid = ol.ol_pgid;
|
||||
dl.osd = ceph_calc_pg_primary(osdc->osdmap, pgid);
|
||||
if (dl.osd >= 0) {
|
||||
struct ceph_entity_addr *a =
|
||||
ceph_osd_addr(osdc->osdmap, dl.osd);
|
||||
if (a)
|
||||
memcpy(&dl.osd_addr, &a->in_addr, sizeof(dl.osd_addr));
|
||||
} else {
|
||||
memset(&dl.osd_addr, 0, sizeof(dl.osd_addr));
|
||||
}
|
||||
up_read(&osdc->map_sem);
|
||||
|
||||
/* send result back to user */
|
||||
if (copy_to_user(arg, &dl, sizeof(dl)))
|
||||
return -EFAULT;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
long ceph_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
|
||||
{
|
||||
dout("ioctl file %p cmd %u arg %lu\n", file, cmd, arg);
|
||||
switch (cmd) {
|
||||
case CEPH_IOC_GET_LAYOUT:
|
||||
return ceph_ioctl_get_layout(file, (void __user *)arg);
|
||||
|
||||
case CEPH_IOC_SET_LAYOUT:
|
||||
return ceph_ioctl_set_layout(file, (void __user *)arg);
|
||||
|
||||
case CEPH_IOC_GET_DATALOC:
|
||||
return ceph_ioctl_get_dataloc(file, (void __user *)arg);
|
||||
}
|
||||
return -ENOTTY;
|
||||
}
|
40
fs/ceph/ioctl.h
Normal file
40
fs/ceph/ioctl.h
Normal file
@ -0,0 +1,40 @@
|
||||
#ifndef FS_CEPH_IOCTL_H
|
||||
#define FS_CEPH_IOCTL_H
|
||||
|
||||
#include <linux/ioctl.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
#define CEPH_IOCTL_MAGIC 0x97
|
||||
|
||||
/* just use u64 to align sanely on all archs */
|
||||
struct ceph_ioctl_layout {
|
||||
__u64 stripe_unit, stripe_count, object_size;
|
||||
__u64 data_pool;
|
||||
__s64 preferred_osd;
|
||||
};
|
||||
|
||||
#define CEPH_IOC_GET_LAYOUT _IOR(CEPH_IOCTL_MAGIC, 1, \
|
||||
struct ceph_ioctl_layout)
|
||||
#define CEPH_IOC_SET_LAYOUT _IOW(CEPH_IOCTL_MAGIC, 2, \
|
||||
struct ceph_ioctl_layout)
|
||||
|
||||
/*
|
||||
* Extract identity, address of the OSD and object storing a given
|
||||
* file offset.
|
||||
*/
|
||||
struct ceph_ioctl_dataloc {
|
||||
__u64 file_offset; /* in+out: file offset */
|
||||
__u64 object_offset; /* out: offset in object */
|
||||
__u64 object_no; /* out: object # */
|
||||
__u64 object_size; /* out: object size */
|
||||
char object_name[64]; /* out: object name */
|
||||
__u64 block_offset; /* out: offset in block */
|
||||
__u64 block_size; /* out: block length */
|
||||
__s64 osd; /* out: osd # */
|
||||
struct sockaddr_storage osd_addr; /* out: osd address */
|
||||
};
|
||||
|
||||
#define CEPH_IOC_GET_DATALOC _IOWR(CEPH_IOCTL_MAGIC, 3, \
|
||||
struct ceph_ioctl_dataloc)
|
||||
|
||||
#endif
|
3021
fs/ceph/mds_client.c
Normal file
3021
fs/ceph/mds_client.c
Normal file
File diff suppressed because it is too large
Load Diff
335
fs/ceph/mds_client.h
Normal file
335
fs/ceph/mds_client.h
Normal file
@ -0,0 +1,335 @@
|
||||
#ifndef _FS_CEPH_MDS_CLIENT_H
|
||||
#define _FS_CEPH_MDS_CLIENT_H
|
||||
|
||||
#include <linux/completion.h>
|
||||
#include <linux/kref.h>
|
||||
#include <linux/list.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/rbtree.h>
|
||||
#include <linux/spinlock.h>
|
||||
|
||||
#include "types.h"
|
||||
#include "messenger.h"
|
||||
#include "mdsmap.h"
|
||||
|
||||
/*
|
||||
* Some lock dependencies:
|
||||
*
|
||||
* session->s_mutex
|
||||
* mdsc->mutex
|
||||
*
|
||||
* mdsc->snap_rwsem
|
||||
*
|
||||
* inode->i_lock
|
||||
* mdsc->snap_flush_lock
|
||||
* mdsc->cap_delay_lock
|
||||
*
|
||||
*/
|
||||
|
||||
struct ceph_client;
|
||||
struct ceph_cap;
|
||||
|
||||
/*
|
||||
* parsed info about a single inode. pointers are into the encoded
|
||||
* on-wire structures within the mds reply message payload.
|
||||
*/
|
||||
struct ceph_mds_reply_info_in {
|
||||
struct ceph_mds_reply_inode *in;
|
||||
u32 symlink_len;
|
||||
char *symlink;
|
||||
u32 xattr_len;
|
||||
char *xattr_data;
|
||||
};
|
||||
|
||||
/*
|
||||
* parsed info about an mds reply, including information about the
|
||||
* target inode and/or its parent directory and dentry, and directory
|
||||
* contents (for readdir results).
|
||||
*/
|
||||
struct ceph_mds_reply_info_parsed {
|
||||
struct ceph_mds_reply_head *head;
|
||||
|
||||
struct ceph_mds_reply_info_in diri, targeti;
|
||||
struct ceph_mds_reply_dirfrag *dirfrag;
|
||||
char *dname;
|
||||
u32 dname_len;
|
||||
struct ceph_mds_reply_lease *dlease;
|
||||
|
||||
struct ceph_mds_reply_dirfrag *dir_dir;
|
||||
int dir_nr;
|
||||
char **dir_dname;
|
||||
u32 *dir_dname_len;
|
||||
struct ceph_mds_reply_lease **dir_dlease;
|
||||
struct ceph_mds_reply_info_in *dir_in;
|
||||
u8 dir_complete, dir_end;
|
||||
|
||||
/* encoded blob describing snapshot contexts for certain
|
||||
operations (e.g., open) */
|
||||
void *snapblob;
|
||||
int snapblob_len;
|
||||
};
|
||||
|
||||
|
||||
/*
|
||||
* cap releases are batched and sent to the MDS en masse.
|
||||
*/
|
||||
#define CEPH_CAPS_PER_RELEASE ((PAGE_CACHE_SIZE - \
|
||||
sizeof(struct ceph_mds_cap_release)) / \
|
||||
sizeof(struct ceph_mds_cap_item))
|
||||
|
||||
|
||||
/*
|
||||
* state associated with each MDS<->client session
|
||||
*/
|
||||
enum {
|
||||
CEPH_MDS_SESSION_NEW = 1,
|
||||
CEPH_MDS_SESSION_OPENING = 2,
|
||||
CEPH_MDS_SESSION_OPEN = 3,
|
||||
CEPH_MDS_SESSION_HUNG = 4,
|
||||
CEPH_MDS_SESSION_CLOSING = 5,
|
||||
CEPH_MDS_SESSION_RESTARTING = 6,
|
||||
CEPH_MDS_SESSION_RECONNECTING = 7,
|
||||
};
|
||||
|
||||
struct ceph_mds_session {
|
||||
struct ceph_mds_client *s_mdsc;
|
||||
int s_mds;
|
||||
int s_state;
|
||||
unsigned long s_ttl; /* time until mds kills us */
|
||||
u64 s_seq; /* incoming msg seq # */
|
||||
struct mutex s_mutex; /* serialize session messages */
|
||||
|
||||
struct ceph_connection s_con;
|
||||
|
||||
struct ceph_authorizer *s_authorizer;
|
||||
void *s_authorizer_buf, *s_authorizer_reply_buf;
|
||||
size_t s_authorizer_buf_len, s_authorizer_reply_buf_len;
|
||||
|
||||
/* protected by s_cap_lock */
|
||||
spinlock_t s_cap_lock;
|
||||
u32 s_cap_gen; /* inc each time we get mds stale msg */
|
||||
unsigned long s_cap_ttl; /* when session caps expire */
|
||||
struct list_head s_caps; /* all caps issued by this session */
|
||||
int s_nr_caps, s_trim_caps;
|
||||
int s_num_cap_releases;
|
||||
struct list_head s_cap_releases; /* waiting cap_release messages */
|
||||
struct list_head s_cap_releases_done; /* ready to send */
|
||||
struct ceph_cap *s_cap_iterator;
|
||||
|
||||
/* protected by mutex */
|
||||
struct list_head s_cap_flushing; /* inodes w/ flushing caps */
|
||||
struct list_head s_cap_snaps_flushing;
|
||||
unsigned long s_renew_requested; /* last time we sent a renew req */
|
||||
u64 s_renew_seq;
|
||||
|
||||
atomic_t s_ref;
|
||||
struct list_head s_waiting; /* waiting requests */
|
||||
struct list_head s_unsafe; /* unsafe requests */
|
||||
};
|
||||
|
||||
/*
|
||||
* modes of choosing which MDS to send a request to
|
||||
*/
|
||||
enum {
|
||||
USE_ANY_MDS,
|
||||
USE_RANDOM_MDS,
|
||||
USE_AUTH_MDS, /* prefer authoritative mds for this metadata item */
|
||||
};
|
||||
|
||||
struct ceph_mds_request;
|
||||
struct ceph_mds_client;
|
||||
|
||||
/*
|
||||
* request completion callback
|
||||
*/
|
||||
typedef void (*ceph_mds_request_callback_t) (struct ceph_mds_client *mdsc,
|
||||
struct ceph_mds_request *req);
|
||||
|
||||
/*
|
||||
* an in-flight mds request
|
||||
*/
|
||||
struct ceph_mds_request {
|
||||
u64 r_tid; /* transaction id */
|
||||
struct rb_node r_node;
|
||||
|
||||
int r_op; /* mds op code */
|
||||
int r_mds;
|
||||
|
||||
/* operation on what? */
|
||||
struct inode *r_inode; /* arg1 */
|
||||
struct dentry *r_dentry; /* arg1 */
|
||||
struct dentry *r_old_dentry; /* arg2: rename from or link from */
|
||||
char *r_path1, *r_path2;
|
||||
struct ceph_vino r_ino1, r_ino2;
|
||||
|
||||
struct inode *r_locked_dir; /* dir (if any) i_mutex locked by vfs */
|
||||
struct inode *r_target_inode; /* resulting inode */
|
||||
|
||||
union ceph_mds_request_args r_args;
|
||||
int r_fmode; /* file mode, if expecting cap */
|
||||
|
||||
/* for choosing which mds to send this request to */
|
||||
int r_direct_mode;
|
||||
u32 r_direct_hash; /* choose dir frag based on this dentry hash */
|
||||
bool r_direct_is_hash; /* true if r_direct_hash is valid */
|
||||
|
||||
/* data payload is used for xattr ops */
|
||||
struct page **r_pages;
|
||||
int r_num_pages;
|
||||
int r_data_len;
|
||||
|
||||
/* what caps shall we drop? */
|
||||
int r_inode_drop, r_inode_unless;
|
||||
int r_dentry_drop, r_dentry_unless;
|
||||
int r_old_dentry_drop, r_old_dentry_unless;
|
||||
struct inode *r_old_inode;
|
||||
int r_old_inode_drop, r_old_inode_unless;
|
||||
|
||||
struct ceph_msg *r_request; /* original request */
|
||||
struct ceph_msg *r_reply;
|
||||
struct ceph_mds_reply_info_parsed r_reply_info;
|
||||
int r_err;
|
||||
bool r_aborted;
|
||||
|
||||
unsigned long r_timeout; /* optional. jiffies */
|
||||
unsigned long r_started; /* start time to measure timeout against */
|
||||
unsigned long r_request_started; /* start time for mds request only,
|
||||
used to measure lease durations */
|
||||
|
||||
/* link unsafe requests to parent directory, for fsync */
|
||||
struct inode *r_unsafe_dir;
|
||||
struct list_head r_unsafe_dir_item;
|
||||
|
||||
struct ceph_mds_session *r_session;
|
||||
|
||||
int r_attempts; /* resend attempts */
|
||||
int r_num_fwd; /* number of forward attempts */
|
||||
int r_num_stale;
|
||||
int r_resend_mds; /* mds to resend to next, if any*/
|
||||
|
||||
struct kref r_kref;
|
||||
struct list_head r_wait;
|
||||
struct completion r_completion;
|
||||
struct completion r_safe_completion;
|
||||
ceph_mds_request_callback_t r_callback;
|
||||
struct list_head r_unsafe_item; /* per-session unsafe list item */
|
||||
bool r_got_unsafe, r_got_safe;
|
||||
|
||||
bool r_did_prepopulate;
|
||||
u32 r_readdir_offset;
|
||||
|
||||
struct ceph_cap_reservation r_caps_reservation;
|
||||
int r_num_caps;
|
||||
};
|
||||
|
||||
/*
|
||||
* mds client state
|
||||
*/
|
||||
struct ceph_mds_client {
|
||||
struct ceph_client *client;
|
||||
struct mutex mutex; /* all nested structures */
|
||||
|
||||
struct ceph_mdsmap *mdsmap;
|
||||
struct completion safe_umount_waiters, session_close_waiters;
|
||||
struct list_head waiting_for_map;
|
||||
|
||||
struct ceph_mds_session **sessions; /* NULL for mds if no session */
|
||||
int max_sessions; /* len of s_mds_sessions */
|
||||
int stopping; /* true if shutting down */
|
||||
|
||||
/*
|
||||
* snap_rwsem will cover cap linkage into snaprealms, and
|
||||
* realm snap contexts. (later, we can do per-realm snap
|
||||
* contexts locks..) the empty list contains realms with no
|
||||
* references (implying they contain no inodes with caps) that
|
||||
* should be destroyed.
|
||||
*/
|
||||
struct rw_semaphore snap_rwsem;
|
||||
struct rb_root snap_realms;
|
||||
struct list_head snap_empty;
|
||||
spinlock_t snap_empty_lock; /* protect snap_empty */
|
||||
|
||||
u64 last_tid; /* most recent mds request */
|
||||
struct rb_root request_tree; /* pending mds requests */
|
||||
struct delayed_work delayed_work; /* delayed work */
|
||||
unsigned long last_renew_caps; /* last time we renewed our caps */
|
||||
struct list_head cap_delay_list; /* caps with delayed release */
|
||||
spinlock_t cap_delay_lock; /* protects cap_delay_list */
|
||||
struct list_head snap_flush_list; /* cap_snaps ready to flush */
|
||||
spinlock_t snap_flush_lock;
|
||||
|
||||
u64 cap_flush_seq;
|
||||
struct list_head cap_dirty; /* inodes with dirty caps */
|
||||
int num_cap_flushing; /* # caps we are flushing */
|
||||
spinlock_t cap_dirty_lock; /* protects above items */
|
||||
wait_queue_head_t cap_flushing_wq;
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debugfs_file;
|
||||
#endif
|
||||
|
||||
spinlock_t dentry_lru_lock;
|
||||
struct list_head dentry_lru;
|
||||
int num_dentry;
|
||||
};
|
||||
|
||||
extern const char *ceph_mds_op_name(int op);
|
||||
|
||||
extern struct ceph_mds_session *
|
||||
__ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
|
||||
|
||||
static inline struct ceph_mds_session *
|
||||
ceph_get_mds_session(struct ceph_mds_session *s)
|
||||
{
|
||||
atomic_inc(&s->s_ref);
|
||||
return s;
|
||||
}
|
||||
|
||||
extern void ceph_put_mds_session(struct ceph_mds_session *s);
|
||||
|
||||
extern int ceph_send_msg_mds(struct ceph_mds_client *mdsc,
|
||||
struct ceph_msg *msg, int mds);
|
||||
|
||||
extern int ceph_mdsc_init(struct ceph_mds_client *mdsc,
|
||||
struct ceph_client *client);
|
||||
extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
|
||||
extern void ceph_mdsc_stop(struct ceph_mds_client *mdsc);
|
||||
|
||||
extern void ceph_mdsc_sync(struct ceph_mds_client *mdsc);
|
||||
|
||||
extern void ceph_mdsc_lease_release(struct ceph_mds_client *mdsc,
|
||||
struct inode *inode,
|
||||
struct dentry *dn, int mask);
|
||||
|
||||
extern struct ceph_mds_request *
|
||||
ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode);
|
||||
extern void ceph_mdsc_submit_request(struct ceph_mds_client *mdsc,
|
||||
struct ceph_mds_request *req);
|
||||
extern int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
|
||||
struct inode *dir,
|
||||
struct ceph_mds_request *req);
|
||||
static inline void ceph_mdsc_get_request(struct ceph_mds_request *req)
|
||||
{
|
||||
kref_get(&req->r_kref);
|
||||
}
|
||||
extern void ceph_mdsc_release_request(struct kref *kref);
|
||||
static inline void ceph_mdsc_put_request(struct ceph_mds_request *req)
|
||||
{
|
||||
kref_put(&req->r_kref, ceph_mdsc_release_request);
|
||||
}
|
||||
|
||||
extern void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc);
|
||||
|
||||
extern char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
|
||||
int stop_on_nosnap);
|
||||
|
||||
extern void __ceph_mdsc_drop_dentry_lease(struct dentry *dentry);
|
||||
extern void ceph_mdsc_lease_send_msg(struct ceph_mds_session *session,
|
||||
struct inode *inode,
|
||||
struct dentry *dentry, char action,
|
||||
u32 seq);
|
||||
|
||||
extern void ceph_mdsc_handle_map(struct ceph_mds_client *mdsc,
|
||||
struct ceph_msg *msg);
|
||||
|
||||
#endif
|
174
fs/ceph/mdsmap.c
Normal file
174
fs/ceph/mdsmap.c
Normal file
@ -0,0 +1,174 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/bug.h>
|
||||
#include <linux/err.h>
|
||||
#include <linux/random.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
#include "mdsmap.h"
|
||||
#include "messenger.h"
|
||||
#include "decode.h"
|
||||
|
||||
#include "super.h"
|
||||
|
||||
|
||||
/*
|
||||
* choose a random mds that is "up" (i.e. has a state > 0), or -1.
|
||||
*/
|
||||
int ceph_mdsmap_get_random_mds(struct ceph_mdsmap *m)
|
||||
{
|
||||
int n = 0;
|
||||
int i;
|
||||
char r;
|
||||
|
||||
/* count */
|
||||
for (i = 0; i < m->m_max_mds; i++)
|
||||
if (m->m_info[i].state > 0)
|
||||
n++;
|
||||
if (n == 0)
|
||||
return -1;
|
||||
|
||||
/* pick */
|
||||
get_random_bytes(&r, 1);
|
||||
n = r % n;
|
||||
i = 0;
|
||||
for (i = 0; n > 0; i++, n--)
|
||||
while (m->m_info[i].state <= 0)
|
||||
i++;
|
||||
|
||||
return i;
|
||||
}
|
||||
|
||||
/*
|
||||
* Decode an MDS map
|
||||
*
|
||||
* Ignore any fields we don't care about (there are quite a few of
|
||||
* them).
|
||||
*/
|
||||
struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end)
|
||||
{
|
||||
struct ceph_mdsmap *m;
|
||||
const void *start = *p;
|
||||
int i, j, n;
|
||||
int err = -EINVAL;
|
||||
u16 version;
|
||||
|
||||
m = kzalloc(sizeof(*m), GFP_NOFS);
|
||||
if (m == NULL)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
ceph_decode_16_safe(p, end, version, bad);
|
||||
|
||||
ceph_decode_need(p, end, 8*sizeof(u32) + sizeof(u64), bad);
|
||||
m->m_epoch = ceph_decode_32(p);
|
||||
m->m_client_epoch = ceph_decode_32(p);
|
||||
m->m_last_failure = ceph_decode_32(p);
|
||||
m->m_root = ceph_decode_32(p);
|
||||
m->m_session_timeout = ceph_decode_32(p);
|
||||
m->m_session_autoclose = ceph_decode_32(p);
|
||||
m->m_max_file_size = ceph_decode_64(p);
|
||||
m->m_max_mds = ceph_decode_32(p);
|
||||
|
||||
m->m_info = kcalloc(m->m_max_mds, sizeof(*m->m_info), GFP_NOFS);
|
||||
if (m->m_info == NULL)
|
||||
goto badmem;
|
||||
|
||||
/* pick out active nodes from mds_info (state > 0) */
|
||||
n = ceph_decode_32(p);
|
||||
for (i = 0; i < n; i++) {
|
||||
u64 global_id;
|
||||
u32 namelen;
|
||||
s32 mds, inc, state;
|
||||
u64 state_seq;
|
||||
u8 infoversion;
|
||||
struct ceph_entity_addr addr;
|
||||
u32 num_export_targets;
|
||||
void *pexport_targets = NULL;
|
||||
|
||||
ceph_decode_need(p, end, sizeof(u64)*2 + 1 + sizeof(u32), bad);
|
||||
global_id = ceph_decode_64(p);
|
||||
infoversion = ceph_decode_8(p);
|
||||
*p += sizeof(u64);
|
||||
namelen = ceph_decode_32(p); /* skip mds name */
|
||||
*p += namelen;
|
||||
|
||||
ceph_decode_need(p, end,
|
||||
4*sizeof(u32) + sizeof(u64) +
|
||||
sizeof(addr) + sizeof(struct ceph_timespec),
|
||||
bad);
|
||||
mds = ceph_decode_32(p);
|
||||
inc = ceph_decode_32(p);
|
||||
state = ceph_decode_32(p);
|
||||
state_seq = ceph_decode_64(p);
|
||||
ceph_decode_copy(p, &addr, sizeof(addr));
|
||||
ceph_decode_addr(&addr);
|
||||
*p += sizeof(struct ceph_timespec);
|
||||
*p += sizeof(u32);
|
||||
ceph_decode_32_safe(p, end, namelen, bad);
|
||||
*p += namelen;
|
||||
if (infoversion >= 2) {
|
||||
ceph_decode_32_safe(p, end, num_export_targets, bad);
|
||||
pexport_targets = *p;
|
||||
*p += num_export_targets * sizeof(u32);
|
||||
} else {
|
||||
num_export_targets = 0;
|
||||
}
|
||||
|
||||
dout("mdsmap_decode %d/%d %lld mds%d.%d %s %s\n",
|
||||
i+1, n, global_id, mds, inc, pr_addr(&addr.in_addr),
|
||||
ceph_mds_state_name(state));
|
||||
if (mds >= 0 && mds < m->m_max_mds && state > 0) {
|
||||
m->m_info[mds].global_id = global_id;
|
||||
m->m_info[mds].state = state;
|
||||
m->m_info[mds].addr = addr;
|
||||
m->m_info[mds].num_export_targets = num_export_targets;
|
||||
if (num_export_targets) {
|
||||
m->m_info[mds].export_targets =
|
||||
kcalloc(num_export_targets, sizeof(u32),
|
||||
GFP_NOFS);
|
||||
for (j = 0; j < num_export_targets; j++)
|
||||
m->m_info[mds].export_targets[j] =
|
||||
ceph_decode_32(&pexport_targets);
|
||||
} else {
|
||||
m->m_info[mds].export_targets = NULL;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* pg_pools */
|
||||
ceph_decode_32_safe(p, end, n, bad);
|
||||
m->m_num_data_pg_pools = n;
|
||||
m->m_data_pg_pools = kcalloc(n, sizeof(u32), GFP_NOFS);
|
||||
if (!m->m_data_pg_pools)
|
||||
goto badmem;
|
||||
ceph_decode_need(p, end, sizeof(u32)*(n+1), bad);
|
||||
for (i = 0; i < n; i++)
|
||||
m->m_data_pg_pools[i] = ceph_decode_32(p);
|
||||
m->m_cas_pg_pool = ceph_decode_32(p);
|
||||
|
||||
/* ok, we don't care about the rest. */
|
||||
dout("mdsmap_decode success epoch %u\n", m->m_epoch);
|
||||
return m;
|
||||
|
||||
badmem:
|
||||
err = -ENOMEM;
|
||||
bad:
|
||||
pr_err("corrupt mdsmap\n");
|
||||
print_hex_dump(KERN_DEBUG, "mdsmap: ",
|
||||
DUMP_PREFIX_OFFSET, 16, 1,
|
||||
start, end - start, true);
|
||||
ceph_mdsmap_destroy(m);
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
|
||||
void ceph_mdsmap_destroy(struct ceph_mdsmap *m)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < m->m_max_mds; i++)
|
||||
kfree(m->m_info[i].export_targets);
|
||||
kfree(m->m_info);
|
||||
kfree(m->m_data_pg_pools);
|
||||
kfree(m);
|
||||
}
|
54
fs/ceph/mdsmap.h
Normal file
54
fs/ceph/mdsmap.h
Normal file
@ -0,0 +1,54 @@
|
||||
#ifndef _FS_CEPH_MDSMAP_H
|
||||
#define _FS_CEPH_MDSMAP_H
|
||||
|
||||
#include "types.h"
|
||||
|
||||
/*
|
||||
* mds map - describe servers in the mds cluster.
|
||||
*
|
||||
* we limit fields to those the client actually xcares about
|
||||
*/
|
||||
struct ceph_mds_info {
|
||||
u64 global_id;
|
||||
struct ceph_entity_addr addr;
|
||||
s32 state;
|
||||
int num_export_targets;
|
||||
u32 *export_targets;
|
||||
};
|
||||
|
||||
struct ceph_mdsmap {
|
||||
u32 m_epoch, m_client_epoch, m_last_failure;
|
||||
u32 m_root;
|
||||
u32 m_session_timeout; /* seconds */
|
||||
u32 m_session_autoclose; /* seconds */
|
||||
u64 m_max_file_size;
|
||||
u32 m_max_mds; /* size of m_addr, m_state arrays */
|
||||
struct ceph_mds_info *m_info;
|
||||
|
||||
/* which object pools file data can be stored in */
|
||||
int m_num_data_pg_pools;
|
||||
u32 *m_data_pg_pools;
|
||||
u32 m_cas_pg_pool;
|
||||
};
|
||||
|
||||
static inline struct ceph_entity_addr *
|
||||
ceph_mdsmap_get_addr(struct ceph_mdsmap *m, int w)
|
||||
{
|
||||
if (w >= m->m_max_mds)
|
||||
return NULL;
|
||||
return &m->m_info[w].addr;
|
||||
}
|
||||
|
||||
static inline int ceph_mdsmap_get_state(struct ceph_mdsmap *m, int w)
|
||||
{
|
||||
BUG_ON(w < 0);
|
||||
if (w >= m->m_max_mds)
|
||||
return CEPH_MDS_STATE_DNE;
|
||||
return m->m_info[w].state;
|
||||
}
|
||||
|
||||
extern int ceph_mdsmap_get_random_mds(struct ceph_mdsmap *m);
|
||||
extern struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end);
|
||||
extern void ceph_mdsmap_destroy(struct ceph_mdsmap *m);
|
||||
|
||||
#endif
|
2240
fs/ceph/messenger.c
Normal file
2240
fs/ceph/messenger.c
Normal file
File diff suppressed because it is too large
Load Diff
254
fs/ceph/messenger.h
Normal file
254
fs/ceph/messenger.h
Normal file
@ -0,0 +1,254 @@
|
||||
#ifndef __FS_CEPH_MESSENGER_H
|
||||
#define __FS_CEPH_MESSENGER_H
|
||||
|
||||
#include <linux/kref.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/net.h>
|
||||
#include <linux/radix-tree.h>
|
||||
#include <linux/uio.h>
|
||||
#include <linux/version.h>
|
||||
#include <linux/workqueue.h>
|
||||
|
||||
#include "types.h"
|
||||
#include "buffer.h"
|
||||
|
||||
struct ceph_msg;
|
||||
struct ceph_connection;
|
||||
|
||||
extern struct workqueue_struct *ceph_msgr_wq; /* receive work queue */
|
||||
|
||||
/*
|
||||
* Ceph defines these callbacks for handling connection events.
|
||||
*/
|
||||
struct ceph_connection_operations {
|
||||
struct ceph_connection *(*get)(struct ceph_connection *);
|
||||
void (*put)(struct ceph_connection *);
|
||||
|
||||
/* handle an incoming message. */
|
||||
void (*dispatch) (struct ceph_connection *con, struct ceph_msg *m);
|
||||
|
||||
/* authorize an outgoing connection */
|
||||
int (*get_authorizer) (struct ceph_connection *con,
|
||||
void **buf, int *len, int *proto,
|
||||
void **reply_buf, int *reply_len, int force_new);
|
||||
int (*verify_authorizer_reply) (struct ceph_connection *con, int len);
|
||||
int (*invalidate_authorizer)(struct ceph_connection *con);
|
||||
|
||||
/* protocol version mismatch */
|
||||
void (*bad_proto) (struct ceph_connection *con);
|
||||
|
||||
/* there was some error on the socket (disconnect, whatever) */
|
||||
void (*fault) (struct ceph_connection *con);
|
||||
|
||||
/* a remote host as terminated a message exchange session, and messages
|
||||
* we sent (or they tried to send us) may be lost. */
|
||||
void (*peer_reset) (struct ceph_connection *con);
|
||||
|
||||
struct ceph_msg * (*alloc_msg) (struct ceph_connection *con,
|
||||
struct ceph_msg_header *hdr,
|
||||
int *skip);
|
||||
};
|
||||
|
||||
extern const char *ceph_name_type_str(int t);
|
||||
|
||||
/* use format string %s%d */
|
||||
#define ENTITY_NAME(n) ceph_name_type_str((n).type), le64_to_cpu((n).num)
|
||||
|
||||
struct ceph_messenger {
|
||||
struct ceph_entity_inst inst; /* my name+address */
|
||||
struct ceph_entity_addr my_enc_addr;
|
||||
struct page *zero_page; /* used in certain error cases */
|
||||
|
||||
bool nocrc;
|
||||
|
||||
/*
|
||||
* the global_seq counts connections i (attempt to) initiate
|
||||
* in order to disambiguate certain connect race conditions.
|
||||
*/
|
||||
u32 global_seq;
|
||||
spinlock_t global_seq_lock;
|
||||
};
|
||||
|
||||
/*
|
||||
* a single message. it contains a header (src, dest, message type, etc.),
|
||||
* footer (crc values, mainly), a "front" message body, and possibly a
|
||||
* data payload (stored in some number of pages).
|
||||
*/
|
||||
struct ceph_msg {
|
||||
struct ceph_msg_header hdr; /* header */
|
||||
struct ceph_msg_footer footer; /* footer */
|
||||
struct kvec front; /* unaligned blobs of message */
|
||||
struct ceph_buffer *middle;
|
||||
struct page **pages; /* data payload. NOT OWNER. */
|
||||
unsigned nr_pages; /* size of page array */
|
||||
struct ceph_pagelist *pagelist; /* instead of pages */
|
||||
struct list_head list_head;
|
||||
struct kref kref;
|
||||
bool front_is_vmalloc;
|
||||
bool more_to_follow;
|
||||
int front_max;
|
||||
|
||||
struct ceph_msgpool *pool;
|
||||
};
|
||||
|
||||
struct ceph_msg_pos {
|
||||
int page, page_pos; /* which page; offset in page */
|
||||
int data_pos; /* offset in data payload */
|
||||
int did_page_crc; /* true if we've calculated crc for current page */
|
||||
};
|
||||
|
||||
/* ceph connection fault delay defaults, for exponential backoff */
|
||||
#define BASE_DELAY_INTERVAL (HZ/2)
|
||||
#define MAX_DELAY_INTERVAL (5 * 60 * HZ)
|
||||
|
||||
/*
|
||||
* ceph_connection state bit flags
|
||||
*
|
||||
* QUEUED and BUSY are used together to ensure that only a single
|
||||
* thread is currently opening, reading or writing data to the socket.
|
||||
*/
|
||||
#define LOSSYTX 0 /* we can close channel or drop messages on errors */
|
||||
#define CONNECTING 1
|
||||
#define NEGOTIATING 2
|
||||
#define KEEPALIVE_PENDING 3
|
||||
#define WRITE_PENDING 4 /* we have data ready to send */
|
||||
#define QUEUED 5 /* there is work queued on this connection */
|
||||
#define BUSY 6 /* work is being done */
|
||||
#define STANDBY 8 /* no outgoing messages, socket closed. we keep
|
||||
* the ceph_connection around to maintain shared
|
||||
* state with the peer. */
|
||||
#define CLOSED 10 /* we've closed the connection */
|
||||
#define SOCK_CLOSED 11 /* socket state changed to closed */
|
||||
#define OPENING 13 /* open connection w/ (possibly new) peer */
|
||||
#define DEAD 14 /* dead, about to kfree */
|
||||
|
||||
/*
|
||||
* A single connection with another host.
|
||||
*
|
||||
* We maintain a queue of outgoing messages, and some session state to
|
||||
* ensure that we can preserve the lossless, ordered delivery of
|
||||
* messages in the case of a TCP disconnect.
|
||||
*/
|
||||
struct ceph_connection {
|
||||
void *private;
|
||||
atomic_t nref;
|
||||
|
||||
const struct ceph_connection_operations *ops;
|
||||
|
||||
struct ceph_messenger *msgr;
|
||||
struct socket *sock;
|
||||
unsigned long state; /* connection state (see flags above) */
|
||||
const char *error_msg; /* error message, if any */
|
||||
|
||||
struct ceph_entity_addr peer_addr; /* peer address */
|
||||
struct ceph_entity_name peer_name; /* peer name */
|
||||
struct ceph_entity_addr peer_addr_for_me;
|
||||
u32 connect_seq; /* identify the most recent connection
|
||||
attempt for this connection, client */
|
||||
u32 peer_global_seq; /* peer's global seq for this connection */
|
||||
|
||||
int auth_retry; /* true if we need a newer authorizer */
|
||||
void *auth_reply_buf; /* where to put the authorizer reply */
|
||||
int auth_reply_buf_len;
|
||||
|
||||
struct mutex mutex;
|
||||
|
||||
/* out queue */
|
||||
struct list_head out_queue;
|
||||
struct list_head out_sent; /* sending or sent but unacked */
|
||||
u64 out_seq; /* last message queued for send */
|
||||
u64 out_seq_sent; /* last message sent */
|
||||
bool out_keepalive_pending;
|
||||
|
||||
u64 in_seq, in_seq_acked; /* last message received, acked */
|
||||
|
||||
/* connection negotiation temps */
|
||||
char in_banner[CEPH_BANNER_MAX_LEN];
|
||||
union {
|
||||
struct { /* outgoing connection */
|
||||
struct ceph_msg_connect out_connect;
|
||||
struct ceph_msg_connect_reply in_reply;
|
||||
};
|
||||
struct { /* incoming */
|
||||
struct ceph_msg_connect in_connect;
|
||||
struct ceph_msg_connect_reply out_reply;
|
||||
};
|
||||
};
|
||||
struct ceph_entity_addr actual_peer_addr;
|
||||
|
||||
/* message out temps */
|
||||
struct ceph_msg *out_msg; /* sending message (== tail of
|
||||
out_sent) */
|
||||
bool out_msg_done;
|
||||
struct ceph_msg_pos out_msg_pos;
|
||||
|
||||
struct kvec out_kvec[8], /* sending header/footer data */
|
||||
*out_kvec_cur;
|
||||
int out_kvec_left; /* kvec's left in out_kvec */
|
||||
int out_skip; /* skip this many bytes */
|
||||
int out_kvec_bytes; /* total bytes left */
|
||||
bool out_kvec_is_msg; /* kvec refers to out_msg */
|
||||
int out_more; /* there is more data after the kvecs */
|
||||
__le64 out_temp_ack; /* for writing an ack */
|
||||
|
||||
/* message in temps */
|
||||
struct ceph_msg_header in_hdr;
|
||||
struct ceph_msg *in_msg;
|
||||
struct ceph_msg_pos in_msg_pos;
|
||||
u32 in_front_crc, in_middle_crc, in_data_crc; /* calculated crc */
|
||||
|
||||
char in_tag; /* protocol control byte */
|
||||
int in_base_pos; /* bytes read */
|
||||
__le64 in_temp_ack; /* for reading an ack */
|
||||
|
||||
struct delayed_work work; /* send|recv work */
|
||||
unsigned long delay; /* current delay interval */
|
||||
};
|
||||
|
||||
|
||||
extern const char *pr_addr(const struct sockaddr_storage *ss);
|
||||
extern int ceph_parse_ips(const char *c, const char *end,
|
||||
struct ceph_entity_addr *addr,
|
||||
int max_count, int *count);
|
||||
|
||||
|
||||
extern int ceph_msgr_init(void);
|
||||
extern void ceph_msgr_exit(void);
|
||||
|
||||
extern struct ceph_messenger *ceph_messenger_create(
|
||||
struct ceph_entity_addr *myaddr);
|
||||
extern void ceph_messenger_destroy(struct ceph_messenger *);
|
||||
|
||||
extern void ceph_con_init(struct ceph_messenger *msgr,
|
||||
struct ceph_connection *con);
|
||||
extern void ceph_con_open(struct ceph_connection *con,
|
||||
struct ceph_entity_addr *addr);
|
||||
extern void ceph_con_close(struct ceph_connection *con);
|
||||
extern void ceph_con_send(struct ceph_connection *con, struct ceph_msg *msg);
|
||||
extern void ceph_con_revoke(struct ceph_connection *con, struct ceph_msg *msg);
|
||||
extern void ceph_con_revoke_message(struct ceph_connection *con,
|
||||
struct ceph_msg *msg);
|
||||
extern void ceph_con_keepalive(struct ceph_connection *con);
|
||||
extern struct ceph_connection *ceph_con_get(struct ceph_connection *con);
|
||||
extern void ceph_con_put(struct ceph_connection *con);
|
||||
|
||||
extern struct ceph_msg *ceph_msg_new(int type, int front_len,
|
||||
int page_len, int page_off,
|
||||
struct page **pages);
|
||||
extern void ceph_msg_kfree(struct ceph_msg *m);
|
||||
|
||||
|
||||
static inline struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
|
||||
{
|
||||
kref_get(&msg->kref);
|
||||
return msg;
|
||||
}
|
||||
extern void ceph_msg_last_put(struct kref *kref);
|
||||
static inline void ceph_msg_put(struct ceph_msg *msg)
|
||||
{
|
||||
kref_put(&msg->kref, ceph_msg_last_put);
|
||||
}
|
||||
|
||||
extern void ceph_msg_dump(struct ceph_msg *msg);
|
||||
|
||||
#endif
|
834
fs/ceph/mon_client.c
Normal file
834
fs/ceph/mon_client.c
Normal file
@ -0,0 +1,834 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/types.h>
|
||||
#include <linux/random.h>
|
||||
#include <linux/sched.h>
|
||||
|
||||
#include "mon_client.h"
|
||||
#include "super.h"
|
||||
#include "auth.h"
|
||||
#include "decode.h"
|
||||
|
||||
/*
|
||||
* Interact with Ceph monitor cluster. Handle requests for new map
|
||||
* versions, and periodically resend as needed. Also implement
|
||||
* statfs() and umount().
|
||||
*
|
||||
* A small cluster of Ceph "monitors" are responsible for managing critical
|
||||
* cluster configuration and state information. An odd number (e.g., 3, 5)
|
||||
* of cmon daemons use a modified version of the Paxos part-time parliament
|
||||
* algorithm to manage the MDS map (mds cluster membership), OSD map, and
|
||||
* list of clients who have mounted the file system.
|
||||
*
|
||||
* We maintain an open, active session with a monitor at all times in order to
|
||||
* receive timely MDSMap updates. We periodically send a keepalive byte on the
|
||||
* TCP socket to ensure we detect a failure. If the connection does break, we
|
||||
* randomly hunt for a new monitor. Once the connection is reestablished, we
|
||||
* resend any outstanding requests.
|
||||
*/
|
||||
|
||||
const static struct ceph_connection_operations mon_con_ops;
|
||||
|
||||
static int __validate_auth(struct ceph_mon_client *monc);
|
||||
|
||||
/*
|
||||
* Decode a monmap blob (e.g., during mount).
|
||||
*/
|
||||
struct ceph_monmap *ceph_monmap_decode(void *p, void *end)
|
||||
{
|
||||
struct ceph_monmap *m = NULL;
|
||||
int i, err = -EINVAL;
|
||||
struct ceph_fsid fsid;
|
||||
u32 epoch, num_mon;
|
||||
u16 version;
|
||||
u32 len;
|
||||
|
||||
ceph_decode_32_safe(&p, end, len, bad);
|
||||
ceph_decode_need(&p, end, len, bad);
|
||||
|
||||
dout("monmap_decode %p %p len %d\n", p, end, (int)(end-p));
|
||||
|
||||
ceph_decode_16_safe(&p, end, version, bad);
|
||||
|
||||
ceph_decode_need(&p, end, sizeof(fsid) + 2*sizeof(u32), bad);
|
||||
ceph_decode_copy(&p, &fsid, sizeof(fsid));
|
||||
epoch = ceph_decode_32(&p);
|
||||
|
||||
num_mon = ceph_decode_32(&p);
|
||||
ceph_decode_need(&p, end, num_mon*sizeof(m->mon_inst[0]), bad);
|
||||
|
||||
if (num_mon >= CEPH_MAX_MON)
|
||||
goto bad;
|
||||
m = kmalloc(sizeof(*m) + sizeof(m->mon_inst[0])*num_mon, GFP_NOFS);
|
||||
if (m == NULL)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
m->fsid = fsid;
|
||||
m->epoch = epoch;
|
||||
m->num_mon = num_mon;
|
||||
ceph_decode_copy(&p, m->mon_inst, num_mon*sizeof(m->mon_inst[0]));
|
||||
for (i = 0; i < num_mon; i++)
|
||||
ceph_decode_addr(&m->mon_inst[i].addr);
|
||||
|
||||
dout("monmap_decode epoch %d, num_mon %d\n", m->epoch,
|
||||
m->num_mon);
|
||||
for (i = 0; i < m->num_mon; i++)
|
||||
dout("monmap_decode mon%d is %s\n", i,
|
||||
pr_addr(&m->mon_inst[i].addr.in_addr));
|
||||
return m;
|
||||
|
||||
bad:
|
||||
dout("monmap_decode failed with %d\n", err);
|
||||
kfree(m);
|
||||
return ERR_PTR(err);
|
||||
}
|
||||
|
||||
/*
|
||||
* return true if *addr is included in the monmap.
|
||||
*/
|
||||
int ceph_monmap_contains(struct ceph_monmap *m, struct ceph_entity_addr *addr)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < m->num_mon; i++)
|
||||
if (memcmp(addr, &m->mon_inst[i].addr, sizeof(*addr)) == 0)
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Send an auth request.
|
||||
*/
|
||||
static void __send_prepared_auth_request(struct ceph_mon_client *monc, int len)
|
||||
{
|
||||
monc->pending_auth = 1;
|
||||
monc->m_auth->front.iov_len = len;
|
||||
monc->m_auth->hdr.front_len = cpu_to_le32(len);
|
||||
ceph_msg_get(monc->m_auth); /* keep our ref */
|
||||
ceph_con_send(monc->con, monc->m_auth);
|
||||
}
|
||||
|
||||
/*
|
||||
* Close monitor session, if any.
|
||||
*/
|
||||
static void __close_session(struct ceph_mon_client *monc)
|
||||
{
|
||||
if (monc->con) {
|
||||
dout("__close_session closing mon%d\n", monc->cur_mon);
|
||||
ceph_con_revoke(monc->con, monc->m_auth);
|
||||
ceph_con_close(monc->con);
|
||||
monc->cur_mon = -1;
|
||||
monc->pending_auth = 0;
|
||||
ceph_auth_reset(monc->auth);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Open a session with a (new) monitor.
|
||||
*/
|
||||
static int __open_session(struct ceph_mon_client *monc)
|
||||
{
|
||||
char r;
|
||||
int ret;
|
||||
|
||||
if (monc->cur_mon < 0) {
|
||||
get_random_bytes(&r, 1);
|
||||
monc->cur_mon = r % monc->monmap->num_mon;
|
||||
dout("open_session num=%d r=%d -> mon%d\n",
|
||||
monc->monmap->num_mon, r, monc->cur_mon);
|
||||
monc->sub_sent = 0;
|
||||
monc->sub_renew_after = jiffies; /* i.e., expired */
|
||||
monc->want_next_osdmap = !!monc->want_next_osdmap;
|
||||
|
||||
dout("open_session mon%d opening\n", monc->cur_mon);
|
||||
monc->con->peer_name.type = CEPH_ENTITY_TYPE_MON;
|
||||
monc->con->peer_name.num = cpu_to_le64(monc->cur_mon);
|
||||
ceph_con_open(monc->con,
|
||||
&monc->monmap->mon_inst[monc->cur_mon].addr);
|
||||
|
||||
/* initiatiate authentication handshake */
|
||||
ret = ceph_auth_build_hello(monc->auth,
|
||||
monc->m_auth->front.iov_base,
|
||||
monc->m_auth->front_max);
|
||||
__send_prepared_auth_request(monc, ret);
|
||||
} else {
|
||||
dout("open_session mon%d already open\n", monc->cur_mon);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static bool __sub_expired(struct ceph_mon_client *monc)
|
||||
{
|
||||
return time_after_eq(jiffies, monc->sub_renew_after);
|
||||
}
|
||||
|
||||
/*
|
||||
* Reschedule delayed work timer.
|
||||
*/
|
||||
static void __schedule_delayed(struct ceph_mon_client *monc)
|
||||
{
|
||||
unsigned delay;
|
||||
|
||||
if (monc->cur_mon < 0 || __sub_expired(monc))
|
||||
delay = 10 * HZ;
|
||||
else
|
||||
delay = 20 * HZ;
|
||||
dout("__schedule_delayed after %u\n", delay);
|
||||
schedule_delayed_work(&monc->delayed_work, delay);
|
||||
}
|
||||
|
||||
/*
|
||||
* Send subscribe request for mdsmap and/or osdmap.
|
||||
*/
|
||||
static void __send_subscribe(struct ceph_mon_client *monc)
|
||||
{
|
||||
dout("__send_subscribe sub_sent=%u exp=%u want_osd=%d\n",
|
||||
(unsigned)monc->sub_sent, __sub_expired(monc),
|
||||
monc->want_next_osdmap);
|
||||
if ((__sub_expired(monc) && !monc->sub_sent) ||
|
||||
monc->want_next_osdmap == 1) {
|
||||
struct ceph_msg *msg;
|
||||
struct ceph_mon_subscribe_item *i;
|
||||
void *p, *end;
|
||||
|
||||
msg = ceph_msg_new(CEPH_MSG_MON_SUBSCRIBE, 96, 0, 0, NULL);
|
||||
if (!msg)
|
||||
return;
|
||||
|
||||
p = msg->front.iov_base;
|
||||
end = p + msg->front.iov_len;
|
||||
|
||||
dout("__send_subscribe to 'mdsmap' %u+\n",
|
||||
(unsigned)monc->have_mdsmap);
|
||||
if (monc->want_next_osdmap) {
|
||||
dout("__send_subscribe to 'osdmap' %u\n",
|
||||
(unsigned)monc->have_osdmap);
|
||||
ceph_encode_32(&p, 3);
|
||||
ceph_encode_string(&p, end, "osdmap", 6);
|
||||
i = p;
|
||||
i->have = cpu_to_le64(monc->have_osdmap);
|
||||
i->onetime = 1;
|
||||
p += sizeof(*i);
|
||||
monc->want_next_osdmap = 2; /* requested */
|
||||
} else {
|
||||
ceph_encode_32(&p, 2);
|
||||
}
|
||||
ceph_encode_string(&p, end, "mdsmap", 6);
|
||||
i = p;
|
||||
i->have = cpu_to_le64(monc->have_mdsmap);
|
||||
i->onetime = 0;
|
||||
p += sizeof(*i);
|
||||
ceph_encode_string(&p, end, "monmap", 6);
|
||||
i = p;
|
||||
i->have = 0;
|
||||
i->onetime = 0;
|
||||
p += sizeof(*i);
|
||||
|
||||
msg->front.iov_len = p - msg->front.iov_base;
|
||||
msg->hdr.front_len = cpu_to_le32(msg->front.iov_len);
|
||||
ceph_con_send(monc->con, msg);
|
||||
|
||||
monc->sub_sent = jiffies | 1; /* never 0 */
|
||||
}
|
||||
}
|
||||
|
||||
static void handle_subscribe_ack(struct ceph_mon_client *monc,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
unsigned seconds;
|
||||
struct ceph_mon_subscribe_ack *h = msg->front.iov_base;
|
||||
|
||||
if (msg->front.iov_len < sizeof(*h))
|
||||
goto bad;
|
||||
seconds = le32_to_cpu(h->duration);
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
if (monc->hunting) {
|
||||
pr_info("mon%d %s session established\n",
|
||||
monc->cur_mon, pr_addr(&monc->con->peer_addr.in_addr));
|
||||
monc->hunting = false;
|
||||
}
|
||||
dout("handle_subscribe_ack after %d seconds\n", seconds);
|
||||
monc->sub_renew_after = monc->sub_sent + (seconds >> 1)*HZ - 1;
|
||||
monc->sub_sent = 0;
|
||||
mutex_unlock(&monc->mutex);
|
||||
return;
|
||||
bad:
|
||||
pr_err("got corrupt subscribe-ack msg\n");
|
||||
ceph_msg_dump(msg);
|
||||
}
|
||||
|
||||
/*
|
||||
* Keep track of which maps we have
|
||||
*/
|
||||
int ceph_monc_got_mdsmap(struct ceph_mon_client *monc, u32 got)
|
||||
{
|
||||
mutex_lock(&monc->mutex);
|
||||
monc->have_mdsmap = got;
|
||||
mutex_unlock(&monc->mutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_monc_got_osdmap(struct ceph_mon_client *monc, u32 got)
|
||||
{
|
||||
mutex_lock(&monc->mutex);
|
||||
monc->have_osdmap = got;
|
||||
monc->want_next_osdmap = 0;
|
||||
mutex_unlock(&monc->mutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Register interest in the next osdmap
|
||||
*/
|
||||
void ceph_monc_request_next_osdmap(struct ceph_mon_client *monc)
|
||||
{
|
||||
dout("request_next_osdmap have %u\n", monc->have_osdmap);
|
||||
mutex_lock(&monc->mutex);
|
||||
if (!monc->want_next_osdmap)
|
||||
monc->want_next_osdmap = 1;
|
||||
if (monc->want_next_osdmap < 2)
|
||||
__send_subscribe(monc);
|
||||
mutex_unlock(&monc->mutex);
|
||||
}
|
||||
|
||||
/*
|
||||
*
|
||||
*/
|
||||
int ceph_monc_open_session(struct ceph_mon_client *monc)
|
||||
{
|
||||
if (!monc->con) {
|
||||
monc->con = kmalloc(sizeof(*monc->con), GFP_KERNEL);
|
||||
if (!monc->con)
|
||||
return -ENOMEM;
|
||||
ceph_con_init(monc->client->msgr, monc->con);
|
||||
monc->con->private = monc;
|
||||
monc->con->ops = &mon_con_ops;
|
||||
}
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
__open_session(monc);
|
||||
__schedule_delayed(monc);
|
||||
mutex_unlock(&monc->mutex);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* The monitor responds with mount ack indicate mount success. The
|
||||
* included client ticket allows the client to talk to MDSs and OSDs.
|
||||
*/
|
||||
static void ceph_monc_handle_map(struct ceph_mon_client *monc,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
struct ceph_client *client = monc->client;
|
||||
struct ceph_monmap *monmap = NULL, *old = monc->monmap;
|
||||
void *p, *end;
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
|
||||
dout("handle_monmap\n");
|
||||
p = msg->front.iov_base;
|
||||
end = p + msg->front.iov_len;
|
||||
|
||||
monmap = ceph_monmap_decode(p, end);
|
||||
if (IS_ERR(monmap)) {
|
||||
pr_err("problem decoding monmap, %d\n",
|
||||
(int)PTR_ERR(monmap));
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (ceph_check_fsid(monc->client, &monmap->fsid) < 0) {
|
||||
kfree(monmap);
|
||||
goto out;
|
||||
}
|
||||
|
||||
client->monc.monmap = monmap;
|
||||
kfree(old);
|
||||
|
||||
out:
|
||||
mutex_unlock(&monc->mutex);
|
||||
wake_up(&client->auth_wq);
|
||||
}
|
||||
|
||||
/*
|
||||
* statfs
|
||||
*/
|
||||
static struct ceph_mon_statfs_request *__lookup_statfs(
|
||||
struct ceph_mon_client *monc, u64 tid)
|
||||
{
|
||||
struct ceph_mon_statfs_request *req;
|
||||
struct rb_node *n = monc->statfs_request_tree.rb_node;
|
||||
|
||||
while (n) {
|
||||
req = rb_entry(n, struct ceph_mon_statfs_request, node);
|
||||
if (tid < req->tid)
|
||||
n = n->rb_left;
|
||||
else if (tid > req->tid)
|
||||
n = n->rb_right;
|
||||
else
|
||||
return req;
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static void __insert_statfs(struct ceph_mon_client *monc,
|
||||
struct ceph_mon_statfs_request *new)
|
||||
{
|
||||
struct rb_node **p = &monc->statfs_request_tree.rb_node;
|
||||
struct rb_node *parent = NULL;
|
||||
struct ceph_mon_statfs_request *req = NULL;
|
||||
|
||||
while (*p) {
|
||||
parent = *p;
|
||||
req = rb_entry(parent, struct ceph_mon_statfs_request, node);
|
||||
if (new->tid < req->tid)
|
||||
p = &(*p)->rb_left;
|
||||
else if (new->tid > req->tid)
|
||||
p = &(*p)->rb_right;
|
||||
else
|
||||
BUG();
|
||||
}
|
||||
|
||||
rb_link_node(&new->node, parent, p);
|
||||
rb_insert_color(&new->node, &monc->statfs_request_tree);
|
||||
}
|
||||
|
||||
static void handle_statfs_reply(struct ceph_mon_client *monc,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
struct ceph_mon_statfs_request *req;
|
||||
struct ceph_mon_statfs_reply *reply = msg->front.iov_base;
|
||||
u64 tid;
|
||||
|
||||
if (msg->front.iov_len != sizeof(*reply))
|
||||
goto bad;
|
||||
tid = le64_to_cpu(msg->hdr.tid);
|
||||
dout("handle_statfs_reply %p tid %llu\n", msg, tid);
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
req = __lookup_statfs(monc, tid);
|
||||
if (req) {
|
||||
*req->buf = reply->st;
|
||||
req->result = 0;
|
||||
}
|
||||
mutex_unlock(&monc->mutex);
|
||||
if (req)
|
||||
complete(&req->completion);
|
||||
return;
|
||||
|
||||
bad:
|
||||
pr_err("corrupt statfs reply, no tid\n");
|
||||
ceph_msg_dump(msg);
|
||||
}
|
||||
|
||||
/*
|
||||
* (re)send a statfs request
|
||||
*/
|
||||
static int send_statfs(struct ceph_mon_client *monc,
|
||||
struct ceph_mon_statfs_request *req)
|
||||
{
|
||||
struct ceph_msg *msg;
|
||||
struct ceph_mon_statfs *h;
|
||||
|
||||
dout("send_statfs tid %llu\n", req->tid);
|
||||
msg = ceph_msg_new(CEPH_MSG_STATFS, sizeof(*h), 0, 0, NULL);
|
||||
if (IS_ERR(msg))
|
||||
return PTR_ERR(msg);
|
||||
req->request = msg;
|
||||
msg->hdr.tid = cpu_to_le64(req->tid);
|
||||
h = msg->front.iov_base;
|
||||
h->monhdr.have_version = 0;
|
||||
h->monhdr.session_mon = cpu_to_le16(-1);
|
||||
h->monhdr.session_mon_tid = 0;
|
||||
h->fsid = monc->monmap->fsid;
|
||||
ceph_con_send(monc->con, msg);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Do a synchronous statfs().
|
||||
*/
|
||||
int ceph_monc_do_statfs(struct ceph_mon_client *monc, struct ceph_statfs *buf)
|
||||
{
|
||||
struct ceph_mon_statfs_request req;
|
||||
int err;
|
||||
|
||||
req.buf = buf;
|
||||
init_completion(&req.completion);
|
||||
|
||||
/* allocate memory for reply */
|
||||
err = ceph_msgpool_resv(&monc->msgpool_statfs_reply, 1);
|
||||
if (err)
|
||||
return err;
|
||||
|
||||
/* register request */
|
||||
mutex_lock(&monc->mutex);
|
||||
req.tid = ++monc->last_tid;
|
||||
req.last_attempt = jiffies;
|
||||
req.delay = BASE_DELAY_INTERVAL;
|
||||
__insert_statfs(monc, &req);
|
||||
monc->num_statfs_requests++;
|
||||
mutex_unlock(&monc->mutex);
|
||||
|
||||
/* send request and wait */
|
||||
err = send_statfs(monc, &req);
|
||||
if (!err)
|
||||
err = wait_for_completion_interruptible(&req.completion);
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
rb_erase(&req.node, &monc->statfs_request_tree);
|
||||
monc->num_statfs_requests--;
|
||||
ceph_msgpool_resv(&monc->msgpool_statfs_reply, -1);
|
||||
mutex_unlock(&monc->mutex);
|
||||
|
||||
if (!err)
|
||||
err = req.result;
|
||||
return err;
|
||||
}
|
||||
|
||||
/*
|
||||
* Resend pending statfs requests.
|
||||
*/
|
||||
static void __resend_statfs(struct ceph_mon_client *monc)
|
||||
{
|
||||
struct ceph_mon_statfs_request *req;
|
||||
struct rb_node *p;
|
||||
|
||||
for (p = rb_first(&monc->statfs_request_tree); p; p = rb_next(p)) {
|
||||
req = rb_entry(p, struct ceph_mon_statfs_request, node);
|
||||
send_statfs(monc, req);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Delayed work. If we haven't mounted yet, retry. Otherwise,
|
||||
* renew/retry subscription as needed (in case it is timing out, or we
|
||||
* got an ENOMEM). And keep the monitor connection alive.
|
||||
*/
|
||||
static void delayed_work(struct work_struct *work)
|
||||
{
|
||||
struct ceph_mon_client *monc =
|
||||
container_of(work, struct ceph_mon_client, delayed_work.work);
|
||||
|
||||
dout("monc delayed_work\n");
|
||||
mutex_lock(&monc->mutex);
|
||||
if (monc->hunting) {
|
||||
__close_session(monc);
|
||||
__open_session(monc); /* continue hunting */
|
||||
} else {
|
||||
ceph_con_keepalive(monc->con);
|
||||
|
||||
__validate_auth(monc);
|
||||
|
||||
if (monc->auth->ops->is_authenticated(monc->auth))
|
||||
__send_subscribe(monc);
|
||||
}
|
||||
__schedule_delayed(monc);
|
||||
mutex_unlock(&monc->mutex);
|
||||
}
|
||||
|
||||
/*
|
||||
* On startup, we build a temporary monmap populated with the IPs
|
||||
* provided by mount(2).
|
||||
*/
|
||||
static int build_initial_monmap(struct ceph_mon_client *monc)
|
||||
{
|
||||
struct ceph_mount_args *args = monc->client->mount_args;
|
||||
struct ceph_entity_addr *mon_addr = args->mon_addr;
|
||||
int num_mon = args->num_mon;
|
||||
int i;
|
||||
|
||||
/* build initial monmap */
|
||||
monc->monmap = kzalloc(sizeof(*monc->monmap) +
|
||||
num_mon*sizeof(monc->monmap->mon_inst[0]),
|
||||
GFP_KERNEL);
|
||||
if (!monc->monmap)
|
||||
return -ENOMEM;
|
||||
for (i = 0; i < num_mon; i++) {
|
||||
monc->monmap->mon_inst[i].addr = mon_addr[i];
|
||||
monc->monmap->mon_inst[i].addr.nonce = 0;
|
||||
monc->monmap->mon_inst[i].name.type =
|
||||
CEPH_ENTITY_TYPE_MON;
|
||||
monc->monmap->mon_inst[i].name.num = cpu_to_le64(i);
|
||||
}
|
||||
monc->monmap->num_mon = num_mon;
|
||||
monc->have_fsid = false;
|
||||
|
||||
/* release addr memory */
|
||||
kfree(args->mon_addr);
|
||||
args->mon_addr = NULL;
|
||||
args->num_mon = 0;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_monc_init(struct ceph_mon_client *monc, struct ceph_client *cl)
|
||||
{
|
||||
int err = 0;
|
||||
|
||||
dout("init\n");
|
||||
memset(monc, 0, sizeof(*monc));
|
||||
monc->client = cl;
|
||||
monc->monmap = NULL;
|
||||
mutex_init(&monc->mutex);
|
||||
|
||||
err = build_initial_monmap(monc);
|
||||
if (err)
|
||||
goto out;
|
||||
|
||||
monc->con = NULL;
|
||||
|
||||
/* authentication */
|
||||
monc->auth = ceph_auth_init(cl->mount_args->name,
|
||||
cl->mount_args->secret);
|
||||
if (IS_ERR(monc->auth))
|
||||
return PTR_ERR(monc->auth);
|
||||
monc->auth->want_keys =
|
||||
CEPH_ENTITY_TYPE_AUTH | CEPH_ENTITY_TYPE_MON |
|
||||
CEPH_ENTITY_TYPE_OSD | CEPH_ENTITY_TYPE_MDS;
|
||||
|
||||
/* msg pools */
|
||||
err = ceph_msgpool_init(&monc->msgpool_subscribe_ack,
|
||||
sizeof(struct ceph_mon_subscribe_ack), 1, false);
|
||||
if (err < 0)
|
||||
goto out_monmap;
|
||||
err = ceph_msgpool_init(&monc->msgpool_statfs_reply,
|
||||
sizeof(struct ceph_mon_statfs_reply), 0, false);
|
||||
if (err < 0)
|
||||
goto out_pool1;
|
||||
err = ceph_msgpool_init(&monc->msgpool_auth_reply, 4096, 1, false);
|
||||
if (err < 0)
|
||||
goto out_pool2;
|
||||
|
||||
monc->m_auth = ceph_msg_new(CEPH_MSG_AUTH, 4096, 0, 0, NULL);
|
||||
monc->pending_auth = 0;
|
||||
if (IS_ERR(monc->m_auth)) {
|
||||
err = PTR_ERR(monc->m_auth);
|
||||
monc->m_auth = NULL;
|
||||
goto out_pool3;
|
||||
}
|
||||
|
||||
monc->cur_mon = -1;
|
||||
monc->hunting = true;
|
||||
monc->sub_renew_after = jiffies;
|
||||
monc->sub_sent = 0;
|
||||
|
||||
INIT_DELAYED_WORK(&monc->delayed_work, delayed_work);
|
||||
monc->statfs_request_tree = RB_ROOT;
|
||||
monc->num_statfs_requests = 0;
|
||||
monc->last_tid = 0;
|
||||
|
||||
monc->have_mdsmap = 0;
|
||||
monc->have_osdmap = 0;
|
||||
monc->want_next_osdmap = 1;
|
||||
return 0;
|
||||
|
||||
out_pool3:
|
||||
ceph_msgpool_destroy(&monc->msgpool_auth_reply);
|
||||
out_pool2:
|
||||
ceph_msgpool_destroy(&monc->msgpool_subscribe_ack);
|
||||
out_pool1:
|
||||
ceph_msgpool_destroy(&monc->msgpool_statfs_reply);
|
||||
out_monmap:
|
||||
kfree(monc->monmap);
|
||||
out:
|
||||
return err;
|
||||
}
|
||||
|
||||
void ceph_monc_stop(struct ceph_mon_client *monc)
|
||||
{
|
||||
dout("stop\n");
|
||||
cancel_delayed_work_sync(&monc->delayed_work);
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
__close_session(monc);
|
||||
if (monc->con) {
|
||||
monc->con->private = NULL;
|
||||
monc->con->ops->put(monc->con);
|
||||
monc->con = NULL;
|
||||
}
|
||||
mutex_unlock(&monc->mutex);
|
||||
|
||||
ceph_auth_destroy(monc->auth);
|
||||
|
||||
ceph_msg_put(monc->m_auth);
|
||||
ceph_msgpool_destroy(&monc->msgpool_subscribe_ack);
|
||||
ceph_msgpool_destroy(&monc->msgpool_statfs_reply);
|
||||
ceph_msgpool_destroy(&monc->msgpool_auth_reply);
|
||||
|
||||
kfree(monc->monmap);
|
||||
}
|
||||
|
||||
static void handle_auth_reply(struct ceph_mon_client *monc,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
int ret;
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
monc->pending_auth = 0;
|
||||
ret = ceph_handle_auth_reply(monc->auth, msg->front.iov_base,
|
||||
msg->front.iov_len,
|
||||
monc->m_auth->front.iov_base,
|
||||
monc->m_auth->front_max);
|
||||
if (ret < 0) {
|
||||
monc->client->auth_err = ret;
|
||||
wake_up(&monc->client->auth_wq);
|
||||
} else if (ret > 0) {
|
||||
__send_prepared_auth_request(monc, ret);
|
||||
} else if (monc->auth->ops->is_authenticated(monc->auth)) {
|
||||
dout("authenticated, starting session\n");
|
||||
|
||||
monc->client->msgr->inst.name.type = CEPH_ENTITY_TYPE_CLIENT;
|
||||
monc->client->msgr->inst.name.num = monc->auth->global_id;
|
||||
|
||||
__send_subscribe(monc);
|
||||
__resend_statfs(monc);
|
||||
}
|
||||
mutex_unlock(&monc->mutex);
|
||||
}
|
||||
|
||||
static int __validate_auth(struct ceph_mon_client *monc)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (monc->pending_auth)
|
||||
return 0;
|
||||
|
||||
ret = ceph_build_auth(monc->auth, monc->m_auth->front.iov_base,
|
||||
monc->m_auth->front_max);
|
||||
if (ret <= 0)
|
||||
return ret; /* either an error, or no need to authenticate */
|
||||
__send_prepared_auth_request(monc, ret);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_monc_validate_auth(struct ceph_mon_client *monc)
|
||||
{
|
||||
int ret;
|
||||
|
||||
mutex_lock(&monc->mutex);
|
||||
ret = __validate_auth(monc);
|
||||
mutex_unlock(&monc->mutex);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* handle incoming message
|
||||
*/
|
||||
static void dispatch(struct ceph_connection *con, struct ceph_msg *msg)
|
||||
{
|
||||
struct ceph_mon_client *monc = con->private;
|
||||
int type = le16_to_cpu(msg->hdr.type);
|
||||
|
||||
if (!monc)
|
||||
return;
|
||||
|
||||
switch (type) {
|
||||
case CEPH_MSG_AUTH_REPLY:
|
||||
handle_auth_reply(monc, msg);
|
||||
break;
|
||||
|
||||
case CEPH_MSG_MON_SUBSCRIBE_ACK:
|
||||
handle_subscribe_ack(monc, msg);
|
||||
break;
|
||||
|
||||
case CEPH_MSG_STATFS_REPLY:
|
||||
handle_statfs_reply(monc, msg);
|
||||
break;
|
||||
|
||||
case CEPH_MSG_MON_MAP:
|
||||
ceph_monc_handle_map(monc, msg);
|
||||
break;
|
||||
|
||||
case CEPH_MSG_MDS_MAP:
|
||||
ceph_mdsc_handle_map(&monc->client->mdsc, msg);
|
||||
break;
|
||||
|
||||
case CEPH_MSG_OSD_MAP:
|
||||
ceph_osdc_handle_map(&monc->client->osdc, msg);
|
||||
break;
|
||||
|
||||
default:
|
||||
pr_err("received unknown message type %d %s\n", type,
|
||||
ceph_msg_type_name(type));
|
||||
}
|
||||
ceph_msg_put(msg);
|
||||
}
|
||||
|
||||
/*
|
||||
* Allocate memory for incoming message
|
||||
*/
|
||||
static struct ceph_msg *mon_alloc_msg(struct ceph_connection *con,
|
||||
struct ceph_msg_header *hdr,
|
||||
int *skip)
|
||||
{
|
||||
struct ceph_mon_client *monc = con->private;
|
||||
int type = le16_to_cpu(hdr->type);
|
||||
int front_len = le32_to_cpu(hdr->front_len);
|
||||
struct ceph_msg *m = NULL;
|
||||
|
||||
*skip = 0;
|
||||
|
||||
switch (type) {
|
||||
case CEPH_MSG_MON_SUBSCRIBE_ACK:
|
||||
m = ceph_msgpool_get(&monc->msgpool_subscribe_ack, front_len);
|
||||
break;
|
||||
case CEPH_MSG_STATFS_REPLY:
|
||||
m = ceph_msgpool_get(&monc->msgpool_statfs_reply, front_len);
|
||||
break;
|
||||
case CEPH_MSG_AUTH_REPLY:
|
||||
m = ceph_msgpool_get(&monc->msgpool_auth_reply, front_len);
|
||||
break;
|
||||
case CEPH_MSG_MON_MAP:
|
||||
case CEPH_MSG_MDS_MAP:
|
||||
case CEPH_MSG_OSD_MAP:
|
||||
m = ceph_msg_new(type, front_len, 0, 0, NULL);
|
||||
break;
|
||||
}
|
||||
|
||||
if (!m) {
|
||||
pr_info("alloc_msg unknown type %d\n", type);
|
||||
*skip = 1;
|
||||
}
|
||||
return m;
|
||||
}
|
||||
|
||||
/*
|
||||
* If the monitor connection resets, pick a new monitor and resubmit
|
||||
* any pending requests.
|
||||
*/
|
||||
static void mon_fault(struct ceph_connection *con)
|
||||
{
|
||||
struct ceph_mon_client *monc = con->private;
|
||||
|
||||
if (!monc)
|
||||
return;
|
||||
|
||||
dout("mon_fault\n");
|
||||
mutex_lock(&monc->mutex);
|
||||
if (!con->private)
|
||||
goto out;
|
||||
|
||||
if (monc->con && !monc->hunting)
|
||||
pr_info("mon%d %s session lost, "
|
||||
"hunting for new mon\n", monc->cur_mon,
|
||||
pr_addr(&monc->con->peer_addr.in_addr));
|
||||
|
||||
__close_session(monc);
|
||||
if (!monc->hunting) {
|
||||
/* start hunting */
|
||||
monc->hunting = true;
|
||||
__open_session(monc);
|
||||
} else {
|
||||
/* already hunting, let's wait a bit */
|
||||
__schedule_delayed(monc);
|
||||
}
|
||||
out:
|
||||
mutex_unlock(&monc->mutex);
|
||||
}
|
||||
|
||||
const static struct ceph_connection_operations mon_con_ops = {
|
||||
.get = ceph_con_get,
|
||||
.put = ceph_con_put,
|
||||
.dispatch = dispatch,
|
||||
.fault = mon_fault,
|
||||
.alloc_msg = mon_alloc_msg,
|
||||
};
|
119
fs/ceph/mon_client.h
Normal file
119
fs/ceph/mon_client.h
Normal file
@ -0,0 +1,119 @@
|
||||
#ifndef _FS_CEPH_MON_CLIENT_H
|
||||
#define _FS_CEPH_MON_CLIENT_H
|
||||
|
||||
#include <linux/completion.h>
|
||||
#include <linux/rbtree.h>
|
||||
|
||||
#include "messenger.h"
|
||||
#include "msgpool.h"
|
||||
|
||||
struct ceph_client;
|
||||
struct ceph_mount_args;
|
||||
struct ceph_auth_client;
|
||||
|
||||
/*
|
||||
* The monitor map enumerates the set of all monitors.
|
||||
*/
|
||||
struct ceph_monmap {
|
||||
struct ceph_fsid fsid;
|
||||
u32 epoch;
|
||||
u32 num_mon;
|
||||
struct ceph_entity_inst mon_inst[0];
|
||||
};
|
||||
|
||||
struct ceph_mon_client;
|
||||
struct ceph_mon_statfs_request;
|
||||
|
||||
|
||||
/*
|
||||
* Generic mechanism for resending monitor requests.
|
||||
*/
|
||||
typedef void (*ceph_monc_request_func_t)(struct ceph_mon_client *monc,
|
||||
int newmon);
|
||||
|
||||
/* a pending monitor request */
|
||||
struct ceph_mon_request {
|
||||
struct ceph_mon_client *monc;
|
||||
struct delayed_work delayed_work;
|
||||
unsigned long delay;
|
||||
ceph_monc_request_func_t do_request;
|
||||
};
|
||||
|
||||
/*
|
||||
* statfs() is done a bit differently because we need to get data back
|
||||
* to the caller
|
||||
*/
|
||||
struct ceph_mon_statfs_request {
|
||||
u64 tid;
|
||||
struct rb_node node;
|
||||
int result;
|
||||
struct ceph_statfs *buf;
|
||||
struct completion completion;
|
||||
unsigned long last_attempt, delay; /* jiffies */
|
||||
struct ceph_msg *request; /* original request */
|
||||
};
|
||||
|
||||
struct ceph_mon_client {
|
||||
struct ceph_client *client;
|
||||
struct ceph_monmap *monmap;
|
||||
|
||||
struct mutex mutex;
|
||||
struct delayed_work delayed_work;
|
||||
|
||||
struct ceph_auth_client *auth;
|
||||
struct ceph_msg *m_auth;
|
||||
int pending_auth;
|
||||
|
||||
bool hunting;
|
||||
int cur_mon; /* last monitor i contacted */
|
||||
unsigned long sub_sent, sub_renew_after;
|
||||
struct ceph_connection *con;
|
||||
bool have_fsid;
|
||||
|
||||
/* msg pools */
|
||||
struct ceph_msgpool msgpool_subscribe_ack;
|
||||
struct ceph_msgpool msgpool_statfs_reply;
|
||||
struct ceph_msgpool msgpool_auth_reply;
|
||||
|
||||
/* pending statfs requests */
|
||||
struct rb_root statfs_request_tree;
|
||||
int num_statfs_requests;
|
||||
u64 last_tid;
|
||||
|
||||
/* mds/osd map */
|
||||
int want_next_osdmap; /* 1 = want, 2 = want+asked */
|
||||
u32 have_osdmap, have_mdsmap;
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debugfs_file;
|
||||
#endif
|
||||
};
|
||||
|
||||
extern struct ceph_monmap *ceph_monmap_decode(void *p, void *end);
|
||||
extern int ceph_monmap_contains(struct ceph_monmap *m,
|
||||
struct ceph_entity_addr *addr);
|
||||
|
||||
extern int ceph_monc_init(struct ceph_mon_client *monc, struct ceph_client *cl);
|
||||
extern void ceph_monc_stop(struct ceph_mon_client *monc);
|
||||
|
||||
/*
|
||||
* The model here is to indicate that we need a new map of at least
|
||||
* epoch @want, and also call in when we receive a map. We will
|
||||
* periodically rerequest the map from the monitor cluster until we
|
||||
* get what we want.
|
||||
*/
|
||||
extern int ceph_monc_got_mdsmap(struct ceph_mon_client *monc, u32 have);
|
||||
extern int ceph_monc_got_osdmap(struct ceph_mon_client *monc, u32 have);
|
||||
|
||||
extern void ceph_monc_request_next_osdmap(struct ceph_mon_client *monc);
|
||||
|
||||
extern int ceph_monc_do_statfs(struct ceph_mon_client *monc,
|
||||
struct ceph_statfs *buf);
|
||||
|
||||
extern int ceph_monc_open_session(struct ceph_mon_client *monc);
|
||||
|
||||
extern int ceph_monc_validate_auth(struct ceph_mon_client *monc);
|
||||
|
||||
|
||||
|
||||
#endif
|
186
fs/ceph/msgpool.c
Normal file
186
fs/ceph/msgpool.c
Normal file
@ -0,0 +1,186 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/err.h>
|
||||
#include <linux/sched.h>
|
||||
#include <linux/types.h>
|
||||
#include <linux/vmalloc.h>
|
||||
|
||||
#include "msgpool.h"
|
||||
|
||||
/*
|
||||
* We use msg pools to preallocate memory for messages we expect to
|
||||
* receive over the wire, to avoid getting ourselves into OOM
|
||||
* conditions at unexpected times. We take use a few different
|
||||
* strategies:
|
||||
*
|
||||
* - for request/response type interactions, we preallocate the
|
||||
* memory needed for the response when we generate the request.
|
||||
*
|
||||
* - for messages we can receive at any time from the MDS, we preallocate
|
||||
* a pool of messages we can re-use.
|
||||
*
|
||||
* - for writeback, we preallocate some number of messages to use for
|
||||
* requests and their replies, so that we always make forward
|
||||
* progress.
|
||||
*
|
||||
* The msgpool behaves like a mempool_t, but keeps preallocated
|
||||
* ceph_msgs strung together on a list_head instead of using a pointer
|
||||
* vector. This avoids vector reallocation when we adjust the number
|
||||
* of preallocated items (which happens frequently).
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
* Allocate or release as necessary to meet our target pool size.
|
||||
*/
|
||||
static int __fill_msgpool(struct ceph_msgpool *pool)
|
||||
{
|
||||
struct ceph_msg *msg;
|
||||
|
||||
while (pool->num < pool->min) {
|
||||
dout("fill_msgpool %p %d/%d allocating\n", pool, pool->num,
|
||||
pool->min);
|
||||
spin_unlock(&pool->lock);
|
||||
msg = ceph_msg_new(0, pool->front_len, 0, 0, NULL);
|
||||
spin_lock(&pool->lock);
|
||||
if (IS_ERR(msg))
|
||||
return PTR_ERR(msg);
|
||||
msg->pool = pool;
|
||||
list_add(&msg->list_head, &pool->msgs);
|
||||
pool->num++;
|
||||
}
|
||||
while (pool->num > pool->min) {
|
||||
msg = list_first_entry(&pool->msgs, struct ceph_msg, list_head);
|
||||
dout("fill_msgpool %p %d/%d releasing %p\n", pool, pool->num,
|
||||
pool->min, msg);
|
||||
list_del_init(&msg->list_head);
|
||||
pool->num--;
|
||||
ceph_msg_kfree(msg);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_msgpool_init(struct ceph_msgpool *pool,
|
||||
int front_len, int min, bool blocking)
|
||||
{
|
||||
int ret;
|
||||
|
||||
dout("msgpool_init %p front_len %d min %d\n", pool, front_len, min);
|
||||
spin_lock_init(&pool->lock);
|
||||
pool->front_len = front_len;
|
||||
INIT_LIST_HEAD(&pool->msgs);
|
||||
pool->num = 0;
|
||||
pool->min = min;
|
||||
pool->blocking = blocking;
|
||||
init_waitqueue_head(&pool->wait);
|
||||
|
||||
spin_lock(&pool->lock);
|
||||
ret = __fill_msgpool(pool);
|
||||
spin_unlock(&pool->lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void ceph_msgpool_destroy(struct ceph_msgpool *pool)
|
||||
{
|
||||
dout("msgpool_destroy %p\n", pool);
|
||||
spin_lock(&pool->lock);
|
||||
pool->min = 0;
|
||||
__fill_msgpool(pool);
|
||||
spin_unlock(&pool->lock);
|
||||
}
|
||||
|
||||
int ceph_msgpool_resv(struct ceph_msgpool *pool, int delta)
|
||||
{
|
||||
int ret;
|
||||
|
||||
spin_lock(&pool->lock);
|
||||
dout("msgpool_resv %p delta %d\n", pool, delta);
|
||||
pool->min += delta;
|
||||
ret = __fill_msgpool(pool);
|
||||
spin_unlock(&pool->lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct ceph_msg *ceph_msgpool_get(struct ceph_msgpool *pool, int front_len)
|
||||
{
|
||||
wait_queue_t wait;
|
||||
struct ceph_msg *msg;
|
||||
|
||||
if (front_len && front_len > pool->front_len) {
|
||||
pr_err("msgpool_get pool %p need front %d, pool size is %d\n",
|
||||
pool, front_len, pool->front_len);
|
||||
WARN_ON(1);
|
||||
|
||||
/* try to alloc a fresh message */
|
||||
msg = ceph_msg_new(0, front_len, 0, 0, NULL);
|
||||
if (!IS_ERR(msg))
|
||||
return msg;
|
||||
}
|
||||
|
||||
if (!front_len)
|
||||
front_len = pool->front_len;
|
||||
|
||||
if (pool->blocking) {
|
||||
/* mempool_t behavior; first try to alloc */
|
||||
msg = ceph_msg_new(0, front_len, 0, 0, NULL);
|
||||
if (!IS_ERR(msg))
|
||||
return msg;
|
||||
}
|
||||
|
||||
while (1) {
|
||||
spin_lock(&pool->lock);
|
||||
if (likely(pool->num)) {
|
||||
msg = list_entry(pool->msgs.next, struct ceph_msg,
|
||||
list_head);
|
||||
list_del_init(&msg->list_head);
|
||||
pool->num--;
|
||||
dout("msgpool_get %p got %p, now %d/%d\n", pool, msg,
|
||||
pool->num, pool->min);
|
||||
spin_unlock(&pool->lock);
|
||||
return msg;
|
||||
}
|
||||
pr_err("msgpool_get %p now %d/%d, %s\n", pool, pool->num,
|
||||
pool->min, pool->blocking ? "waiting" : "may fail");
|
||||
spin_unlock(&pool->lock);
|
||||
|
||||
if (!pool->blocking) {
|
||||
WARN_ON(1);
|
||||
|
||||
/* maybe we can allocate it now? */
|
||||
msg = ceph_msg_new(0, front_len, 0, 0, NULL);
|
||||
if (!IS_ERR(msg))
|
||||
return msg;
|
||||
|
||||
pr_err("msgpool_get %p empty + alloc failed\n", pool);
|
||||
return ERR_PTR(-ENOMEM);
|
||||
}
|
||||
|
||||
init_wait(&wait);
|
||||
prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
|
||||
schedule();
|
||||
finish_wait(&pool->wait, &wait);
|
||||
}
|
||||
}
|
||||
|
||||
void ceph_msgpool_put(struct ceph_msgpool *pool, struct ceph_msg *msg)
|
||||
{
|
||||
spin_lock(&pool->lock);
|
||||
if (pool->num < pool->min) {
|
||||
/* reset msg front_len; user may have changed it */
|
||||
msg->front.iov_len = pool->front_len;
|
||||
msg->hdr.front_len = cpu_to_le32(pool->front_len);
|
||||
|
||||
kref_set(&msg->kref, 1); /* retake a single ref */
|
||||
list_add(&msg->list_head, &pool->msgs);
|
||||
pool->num++;
|
||||
dout("msgpool_put %p reclaim %p, now %d/%d\n", pool, msg,
|
||||
pool->num, pool->min);
|
||||
spin_unlock(&pool->lock);
|
||||
wake_up(&pool->wait);
|
||||
} else {
|
||||
dout("msgpool_put %p drop %p, at %d/%d\n", pool, msg,
|
||||
pool->num, pool->min);
|
||||
spin_unlock(&pool->lock);
|
||||
ceph_msg_kfree(msg);
|
||||
}
|
||||
}
|
27
fs/ceph/msgpool.h
Normal file
27
fs/ceph/msgpool.h
Normal file
@ -0,0 +1,27 @@
|
||||
#ifndef _FS_CEPH_MSGPOOL
|
||||
#define _FS_CEPH_MSGPOOL
|
||||
|
||||
#include "messenger.h"
|
||||
|
||||
/*
|
||||
* we use memory pools for preallocating messages we may receive, to
|
||||
* avoid unexpected OOM conditions.
|
||||
*/
|
||||
struct ceph_msgpool {
|
||||
spinlock_t lock;
|
||||
int front_len; /* preallocated payload size */
|
||||
struct list_head msgs; /* msgs in the pool; each has 1 ref */
|
||||
int num, min; /* cur, min # msgs in the pool */
|
||||
bool blocking;
|
||||
wait_queue_head_t wait;
|
||||
};
|
||||
|
||||
extern int ceph_msgpool_init(struct ceph_msgpool *pool,
|
||||
int front_len, int size, bool blocking);
|
||||
extern void ceph_msgpool_destroy(struct ceph_msgpool *pool);
|
||||
extern int ceph_msgpool_resv(struct ceph_msgpool *, int delta);
|
||||
extern struct ceph_msg *ceph_msgpool_get(struct ceph_msgpool *,
|
||||
int front_len);
|
||||
extern void ceph_msgpool_put(struct ceph_msgpool *, struct ceph_msg *);
|
||||
|
||||
#endif
|
158
fs/ceph/msgr.h
Normal file
158
fs/ceph/msgr.h
Normal file
@ -0,0 +1,158 @@
|
||||
#ifndef __MSGR_H
|
||||
#define __MSGR_H
|
||||
|
||||
/*
|
||||
* Data types for message passing layer used by Ceph.
|
||||
*/
|
||||
|
||||
#define CEPH_MON_PORT 6789 /* default monitor port */
|
||||
|
||||
/*
|
||||
* client-side processes will try to bind to ports in this
|
||||
* range, simply for the benefit of tools like nmap or wireshark
|
||||
* that would like to identify the protocol.
|
||||
*/
|
||||
#define CEPH_PORT_FIRST 6789
|
||||
#define CEPH_PORT_START 6800 /* non-monitors start here */
|
||||
#define CEPH_PORT_LAST 6900
|
||||
|
||||
/*
|
||||
* tcp connection banner. include a protocol version. and adjust
|
||||
* whenever the wire protocol changes. try to keep this string length
|
||||
* constant.
|
||||
*/
|
||||
#define CEPH_BANNER "ceph v027"
|
||||
#define CEPH_BANNER_MAX_LEN 30
|
||||
|
||||
|
||||
/*
|
||||
* Rollover-safe type and comparator for 32-bit sequence numbers.
|
||||
* Comparator returns -1, 0, or 1.
|
||||
*/
|
||||
typedef __u32 ceph_seq_t;
|
||||
|
||||
static inline __s32 ceph_seq_cmp(__u32 a, __u32 b)
|
||||
{
|
||||
return (__s32)a - (__s32)b;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* entity_name -- logical name for a process participating in the
|
||||
* network, e.g. 'mds0' or 'osd3'.
|
||||
*/
|
||||
struct ceph_entity_name {
|
||||
__u8 type; /* CEPH_ENTITY_TYPE_* */
|
||||
__le64 num;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_ENTITY_TYPE_MON 0x01
|
||||
#define CEPH_ENTITY_TYPE_MDS 0x02
|
||||
#define CEPH_ENTITY_TYPE_OSD 0x04
|
||||
#define CEPH_ENTITY_TYPE_CLIENT 0x08
|
||||
#define CEPH_ENTITY_TYPE_ADMIN 0x10
|
||||
#define CEPH_ENTITY_TYPE_AUTH 0x20
|
||||
|
||||
#define CEPH_ENTITY_TYPE_ANY 0xFF
|
||||
|
||||
extern const char *ceph_entity_type_name(int type);
|
||||
|
||||
/*
|
||||
* entity_addr -- network address
|
||||
*/
|
||||
struct ceph_entity_addr {
|
||||
__le32 type;
|
||||
__le32 nonce; /* unique id for process (e.g. pid) */
|
||||
struct sockaddr_storage in_addr;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_entity_inst {
|
||||
struct ceph_entity_name name;
|
||||
struct ceph_entity_addr addr;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
/* used by message exchange protocol */
|
||||
#define CEPH_MSGR_TAG_READY 1 /* server->client: ready for messages */
|
||||
#define CEPH_MSGR_TAG_RESETSESSION 2 /* server->client: reset, try again */
|
||||
#define CEPH_MSGR_TAG_WAIT 3 /* server->client: wait for racing
|
||||
incoming connection */
|
||||
#define CEPH_MSGR_TAG_RETRY_SESSION 4 /* server->client + cseq: try again
|
||||
with higher cseq */
|
||||
#define CEPH_MSGR_TAG_RETRY_GLOBAL 5 /* server->client + gseq: try again
|
||||
with higher gseq */
|
||||
#define CEPH_MSGR_TAG_CLOSE 6 /* closing pipe */
|
||||
#define CEPH_MSGR_TAG_MSG 7 /* message */
|
||||
#define CEPH_MSGR_TAG_ACK 8 /* message ack */
|
||||
#define CEPH_MSGR_TAG_KEEPALIVE 9 /* just a keepalive byte! */
|
||||
#define CEPH_MSGR_TAG_BADPROTOVER 10 /* bad protocol version */
|
||||
#define CEPH_MSGR_TAG_BADAUTHORIZER 11 /* bad authorizer */
|
||||
#define CEPH_MSGR_TAG_FEATURES 12 /* insufficient features */
|
||||
|
||||
|
||||
/*
|
||||
* connection negotiation
|
||||
*/
|
||||
struct ceph_msg_connect {
|
||||
__le64 features; /* supported feature bits */
|
||||
__le32 host_type; /* CEPH_ENTITY_TYPE_* */
|
||||
__le32 global_seq; /* count connections initiated by this host */
|
||||
__le32 connect_seq; /* count connections initiated in this session */
|
||||
__le32 protocol_version;
|
||||
__le32 authorizer_protocol;
|
||||
__le32 authorizer_len;
|
||||
__u8 flags; /* CEPH_MSG_CONNECT_* */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_msg_connect_reply {
|
||||
__u8 tag;
|
||||
__le64 features; /* feature bits for this session */
|
||||
__le32 global_seq;
|
||||
__le32 connect_seq;
|
||||
__le32 protocol_version;
|
||||
__le32 authorizer_len;
|
||||
__u8 flags;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MSG_CONNECT_LOSSY 1 /* messages i send may be safely dropped */
|
||||
|
||||
|
||||
/*
|
||||
* message header
|
||||
*/
|
||||
struct ceph_msg_header {
|
||||
__le64 seq; /* message seq# for this session */
|
||||
__le64 tid; /* transaction id */
|
||||
__le16 type; /* message type */
|
||||
__le16 priority; /* priority. higher value == higher priority */
|
||||
__le16 version; /* version of message encoding */
|
||||
|
||||
__le32 front_len; /* bytes in main payload */
|
||||
__le32 middle_len;/* bytes in middle payload */
|
||||
__le32 data_len; /* bytes of data payload */
|
||||
__le16 data_off; /* sender: include full offset;
|
||||
receiver: mask against ~PAGE_MASK */
|
||||
|
||||
struct ceph_entity_inst src, orig_src;
|
||||
__le32 reserved;
|
||||
__le32 crc; /* header crc32c */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MSG_PRIO_LOW 64
|
||||
#define CEPH_MSG_PRIO_DEFAULT 127
|
||||
#define CEPH_MSG_PRIO_HIGH 196
|
||||
#define CEPH_MSG_PRIO_HIGHEST 255
|
||||
|
||||
/*
|
||||
* follows data payload
|
||||
*/
|
||||
struct ceph_msg_footer {
|
||||
__le32 front_crc, middle_crc, data_crc;
|
||||
__u8 flags;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
#define CEPH_MSG_FOOTER_COMPLETE (1<<0) /* msg wasn't aborted */
|
||||
#define CEPH_MSG_FOOTER_NOCRC (1<<1) /* no data crc */
|
||||
|
||||
|
||||
#endif
|
1537
fs/ceph/osd_client.c
Normal file
1537
fs/ceph/osd_client.c
Normal file
File diff suppressed because it is too large
Load Diff
166
fs/ceph/osd_client.h
Normal file
166
fs/ceph/osd_client.h
Normal file
@ -0,0 +1,166 @@
|
||||
#ifndef _FS_CEPH_OSD_CLIENT_H
|
||||
#define _FS_CEPH_OSD_CLIENT_H
|
||||
|
||||
#include <linux/completion.h>
|
||||
#include <linux/kref.h>
|
||||
#include <linux/mempool.h>
|
||||
#include <linux/rbtree.h>
|
||||
|
||||
#include "types.h"
|
||||
#include "osdmap.h"
|
||||
#include "messenger.h"
|
||||
|
||||
struct ceph_msg;
|
||||
struct ceph_snap_context;
|
||||
struct ceph_osd_request;
|
||||
struct ceph_osd_client;
|
||||
struct ceph_authorizer;
|
||||
|
||||
/*
|
||||
* completion callback for async writepages
|
||||
*/
|
||||
typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *,
|
||||
struct ceph_msg *);
|
||||
|
||||
/* a given osd we're communicating with */
|
||||
struct ceph_osd {
|
||||
atomic_t o_ref;
|
||||
struct ceph_osd_client *o_osdc;
|
||||
int o_osd;
|
||||
int o_incarnation;
|
||||
struct rb_node o_node;
|
||||
struct ceph_connection o_con;
|
||||
struct list_head o_requests;
|
||||
struct list_head o_osd_lru;
|
||||
struct ceph_authorizer *o_authorizer;
|
||||
void *o_authorizer_buf, *o_authorizer_reply_buf;
|
||||
size_t o_authorizer_buf_len, o_authorizer_reply_buf_len;
|
||||
unsigned long lru_ttl;
|
||||
int o_marked_for_keepalive;
|
||||
struct list_head o_keepalive_item;
|
||||
};
|
||||
|
||||
/* an in-flight request */
|
||||
struct ceph_osd_request {
|
||||
u64 r_tid; /* unique for this client */
|
||||
struct rb_node r_node;
|
||||
struct list_head r_req_lru_item;
|
||||
struct list_head r_osd_item;
|
||||
struct ceph_osd *r_osd;
|
||||
struct ceph_pg r_pgid;
|
||||
|
||||
struct ceph_connection *r_con_filling_msg;
|
||||
|
||||
struct ceph_msg *r_request, *r_reply;
|
||||
int r_result;
|
||||
int r_flags; /* any additional flags for the osd */
|
||||
u32 r_sent; /* >0 if r_request is sending/sent */
|
||||
int r_got_reply;
|
||||
|
||||
struct ceph_osd_client *r_osdc;
|
||||
struct kref r_kref;
|
||||
bool r_mempool;
|
||||
struct completion r_completion, r_safe_completion;
|
||||
ceph_osdc_callback_t r_callback, r_safe_callback;
|
||||
struct ceph_eversion r_reassert_version;
|
||||
struct list_head r_unsafe_item;
|
||||
|
||||
struct inode *r_inode; /* for use by callbacks */
|
||||
struct writeback_control *r_wbc; /* ditto */
|
||||
|
||||
char r_oid[40]; /* object name */
|
||||
int r_oid_len;
|
||||
unsigned long r_sent_stamp;
|
||||
bool r_resend; /* msg send failed, needs retry */
|
||||
|
||||
struct ceph_file_layout r_file_layout;
|
||||
struct ceph_snap_context *r_snapc; /* snap context for writes */
|
||||
unsigned r_num_pages; /* size of page array (follows) */
|
||||
struct page **r_pages; /* pages for data payload */
|
||||
int r_pages_from_pool;
|
||||
int r_own_pages; /* if true, i own page list */
|
||||
};
|
||||
|
||||
struct ceph_osd_client {
|
||||
struct ceph_client *client;
|
||||
|
||||
struct ceph_osdmap *osdmap; /* current map */
|
||||
struct rw_semaphore map_sem;
|
||||
struct completion map_waiters;
|
||||
u64 last_requested_map;
|
||||
|
||||
struct mutex request_mutex;
|
||||
struct rb_root osds; /* osds */
|
||||
struct list_head osd_lru; /* idle osds */
|
||||
u64 timeout_tid; /* tid of timeout triggering rq */
|
||||
u64 last_tid; /* tid of last request */
|
||||
struct rb_root requests; /* pending requests */
|
||||
struct list_head req_lru; /* pending requests lru */
|
||||
int num_requests;
|
||||
struct delayed_work timeout_work;
|
||||
struct delayed_work osds_timeout_work;
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debugfs_file;
|
||||
#endif
|
||||
|
||||
mempool_t *req_mempool;
|
||||
|
||||
struct ceph_msgpool msgpool_op;
|
||||
struct ceph_msgpool msgpool_op_reply;
|
||||
};
|
||||
|
||||
extern int ceph_osdc_init(struct ceph_osd_client *osdc,
|
||||
struct ceph_client *client);
|
||||
extern void ceph_osdc_stop(struct ceph_osd_client *osdc);
|
||||
|
||||
extern void ceph_osdc_handle_reply(struct ceph_osd_client *osdc,
|
||||
struct ceph_msg *msg);
|
||||
extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc,
|
||||
struct ceph_msg *msg);
|
||||
|
||||
extern struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *,
|
||||
struct ceph_file_layout *layout,
|
||||
struct ceph_vino vino,
|
||||
u64 offset, u64 *len, int op, int flags,
|
||||
struct ceph_snap_context *snapc,
|
||||
int do_sync, u32 truncate_seq,
|
||||
u64 truncate_size,
|
||||
struct timespec *mtime,
|
||||
bool use_mempool, int num_reply);
|
||||
|
||||
static inline void ceph_osdc_get_request(struct ceph_osd_request *req)
|
||||
{
|
||||
kref_get(&req->r_kref);
|
||||
}
|
||||
extern void ceph_osdc_release_request(struct kref *kref);
|
||||
static inline void ceph_osdc_put_request(struct ceph_osd_request *req)
|
||||
{
|
||||
kref_put(&req->r_kref, ceph_osdc_release_request);
|
||||
}
|
||||
|
||||
extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
|
||||
struct ceph_osd_request *req,
|
||||
bool nofail);
|
||||
extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
|
||||
struct ceph_osd_request *req);
|
||||
extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
|
||||
|
||||
extern int ceph_osdc_readpages(struct ceph_osd_client *osdc,
|
||||
struct ceph_vino vino,
|
||||
struct ceph_file_layout *layout,
|
||||
u64 off, u64 *plen,
|
||||
u32 truncate_seq, u64 truncate_size,
|
||||
struct page **pages, int nr_pages);
|
||||
|
||||
extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
|
||||
struct ceph_vino vino,
|
||||
struct ceph_file_layout *layout,
|
||||
struct ceph_snap_context *sc,
|
||||
u64 off, u64 len,
|
||||
u32 truncate_seq, u64 truncate_size,
|
||||
struct timespec *mtime,
|
||||
struct page **pages, int nr_pages,
|
||||
int flags, int do_sync, bool nofail);
|
||||
|
||||
#endif
|
||||
|
1019
fs/ceph/osdmap.c
Normal file
1019
fs/ceph/osdmap.c
Normal file
File diff suppressed because it is too large
Load Diff
125
fs/ceph/osdmap.h
Normal file
125
fs/ceph/osdmap.h
Normal file
@ -0,0 +1,125 @@
|
||||
#ifndef _FS_CEPH_OSDMAP_H
|
||||
#define _FS_CEPH_OSDMAP_H
|
||||
|
||||
#include <linux/rbtree.h>
|
||||
#include "types.h"
|
||||
#include "ceph_fs.h"
|
||||
#include "crush/crush.h"
|
||||
|
||||
/*
|
||||
* The osd map describes the current membership of the osd cluster and
|
||||
* specifies the mapping of objects to placement groups and placement
|
||||
* groups to (sets of) osds. That is, it completely specifies the
|
||||
* (desired) distribution of all data objects in the system at some
|
||||
* point in time.
|
||||
*
|
||||
* Each map version is identified by an epoch, which increases monotonically.
|
||||
*
|
||||
* The map can be updated either via an incremental map (diff) describing
|
||||
* the change between two successive epochs, or as a fully encoded map.
|
||||
*/
|
||||
struct ceph_pg_pool_info {
|
||||
struct rb_node node;
|
||||
int id;
|
||||
struct ceph_pg_pool v;
|
||||
int pg_num_mask, pgp_num_mask, lpg_num_mask, lpgp_num_mask;
|
||||
};
|
||||
|
||||
struct ceph_pg_mapping {
|
||||
struct rb_node node;
|
||||
struct ceph_pg pgid;
|
||||
int len;
|
||||
int osds[];
|
||||
};
|
||||
|
||||
struct ceph_osdmap {
|
||||
struct ceph_fsid fsid;
|
||||
u32 epoch;
|
||||
u32 mkfs_epoch;
|
||||
struct ceph_timespec created, modified;
|
||||
|
||||
u32 flags; /* CEPH_OSDMAP_* */
|
||||
|
||||
u32 max_osd; /* size of osd_state, _offload, _addr arrays */
|
||||
u8 *osd_state; /* CEPH_OSD_* */
|
||||
u32 *osd_weight; /* 0 = failed, 0x10000 = 100% normal */
|
||||
struct ceph_entity_addr *osd_addr;
|
||||
|
||||
struct rb_root pg_temp;
|
||||
struct rb_root pg_pools;
|
||||
u32 pool_max;
|
||||
|
||||
/* the CRUSH map specifies the mapping of placement groups to
|
||||
* the list of osds that store+replicate them. */
|
||||
struct crush_map *crush;
|
||||
};
|
||||
|
||||
/*
|
||||
* file layout helpers
|
||||
*/
|
||||
#define ceph_file_layout_su(l) ((__s32)le32_to_cpu((l).fl_stripe_unit))
|
||||
#define ceph_file_layout_stripe_count(l) \
|
||||
((__s32)le32_to_cpu((l).fl_stripe_count))
|
||||
#define ceph_file_layout_object_size(l) ((__s32)le32_to_cpu((l).fl_object_size))
|
||||
#define ceph_file_layout_cas_hash(l) ((__s32)le32_to_cpu((l).fl_cas_hash))
|
||||
#define ceph_file_layout_object_su(l) \
|
||||
((__s32)le32_to_cpu((l).fl_object_stripe_unit))
|
||||
#define ceph_file_layout_pg_preferred(l) \
|
||||
((__s32)le32_to_cpu((l).fl_pg_preferred))
|
||||
#define ceph_file_layout_pg_pool(l) \
|
||||
((__s32)le32_to_cpu((l).fl_pg_pool))
|
||||
|
||||
static inline unsigned ceph_file_layout_stripe_width(struct ceph_file_layout *l)
|
||||
{
|
||||
return le32_to_cpu(l->fl_stripe_unit) *
|
||||
le32_to_cpu(l->fl_stripe_count);
|
||||
}
|
||||
|
||||
/* "period" == bytes before i start on a new set of objects */
|
||||
static inline unsigned ceph_file_layout_period(struct ceph_file_layout *l)
|
||||
{
|
||||
return le32_to_cpu(l->fl_object_size) *
|
||||
le32_to_cpu(l->fl_stripe_count);
|
||||
}
|
||||
|
||||
|
||||
static inline int ceph_osd_is_up(struct ceph_osdmap *map, int osd)
|
||||
{
|
||||
return (osd < map->max_osd) && (map->osd_state[osd] & CEPH_OSD_UP);
|
||||
}
|
||||
|
||||
static inline bool ceph_osdmap_flag(struct ceph_osdmap *map, int flag)
|
||||
{
|
||||
return map && (map->flags & flag);
|
||||
}
|
||||
|
||||
extern char *ceph_osdmap_state_str(char *str, int len, int state);
|
||||
|
||||
static inline struct ceph_entity_addr *ceph_osd_addr(struct ceph_osdmap *map,
|
||||
int osd)
|
||||
{
|
||||
if (osd >= map->max_osd)
|
||||
return NULL;
|
||||
return &map->osd_addr[osd];
|
||||
}
|
||||
|
||||
extern struct ceph_osdmap *osdmap_decode(void **p, void *end);
|
||||
extern struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
|
||||
struct ceph_osdmap *map,
|
||||
struct ceph_messenger *msgr);
|
||||
extern void ceph_osdmap_destroy(struct ceph_osdmap *map);
|
||||
|
||||
/* calculate mapping of a file extent to an object */
|
||||
extern void ceph_calc_file_object_mapping(struct ceph_file_layout *layout,
|
||||
u64 off, u64 *plen,
|
||||
u64 *bno, u64 *oxoff, u64 *oxlen);
|
||||
|
||||
/* calculate mapping of object to a placement group */
|
||||
extern int ceph_calc_object_layout(struct ceph_object_layout *ol,
|
||||
const char *oid,
|
||||
struct ceph_file_layout *fl,
|
||||
struct ceph_osdmap *osdmap);
|
||||
extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
|
||||
struct ceph_pg pgid);
|
||||
|
||||
#endif
|
54
fs/ceph/pagelist.c
Normal file
54
fs/ceph/pagelist.c
Normal file
@ -0,0 +1,54 @@
|
||||
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/highmem.h>
|
||||
|
||||
#include "pagelist.h"
|
||||
|
||||
int ceph_pagelist_release(struct ceph_pagelist *pl)
|
||||
{
|
||||
if (pl->mapped_tail)
|
||||
kunmap(pl->mapped_tail);
|
||||
while (!list_empty(&pl->head)) {
|
||||
struct page *page = list_first_entry(&pl->head, struct page,
|
||||
lru);
|
||||
list_del(&page->lru);
|
||||
__free_page(page);
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
|
||||
{
|
||||
struct page *page = alloc_page(GFP_NOFS);
|
||||
if (!page)
|
||||
return -ENOMEM;
|
||||
pl->room += PAGE_SIZE;
|
||||
list_add_tail(&page->lru, &pl->head);
|
||||
if (pl->mapped_tail)
|
||||
kunmap(pl->mapped_tail);
|
||||
pl->mapped_tail = kmap(page);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ceph_pagelist_append(struct ceph_pagelist *pl, void *buf, size_t len)
|
||||
{
|
||||
while (pl->room < len) {
|
||||
size_t bit = pl->room;
|
||||
int ret;
|
||||
|
||||
memcpy(pl->mapped_tail + (pl->length & ~PAGE_CACHE_MASK),
|
||||
buf, bit);
|
||||
pl->length += bit;
|
||||
pl->room -= bit;
|
||||
buf += bit;
|
||||
len -= bit;
|
||||
ret = ceph_pagelist_addpage(pl);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
memcpy(pl->mapped_tail + (pl->length & ~PAGE_CACHE_MASK), buf, len);
|
||||
pl->length += len;
|
||||
pl->room -= len;
|
||||
return 0;
|
||||
}
|
54
fs/ceph/pagelist.h
Normal file
54
fs/ceph/pagelist.h
Normal file
@ -0,0 +1,54 @@
|
||||
#ifndef __FS_CEPH_PAGELIST_H
|
||||
#define __FS_CEPH_PAGELIST_H
|
||||
|
||||
#include <linux/list.h>
|
||||
|
||||
struct ceph_pagelist {
|
||||
struct list_head head;
|
||||
void *mapped_tail;
|
||||
size_t length;
|
||||
size_t room;
|
||||
};
|
||||
|
||||
static inline void ceph_pagelist_init(struct ceph_pagelist *pl)
|
||||
{
|
||||
INIT_LIST_HEAD(&pl->head);
|
||||
pl->mapped_tail = NULL;
|
||||
pl->length = 0;
|
||||
pl->room = 0;
|
||||
}
|
||||
extern int ceph_pagelist_release(struct ceph_pagelist *pl);
|
||||
|
||||
extern int ceph_pagelist_append(struct ceph_pagelist *pl, void *d, size_t l);
|
||||
|
||||
static inline int ceph_pagelist_encode_64(struct ceph_pagelist *pl, u64 v)
|
||||
{
|
||||
__le64 ev = cpu_to_le64(v);
|
||||
return ceph_pagelist_append(pl, &ev, sizeof(ev));
|
||||
}
|
||||
static inline int ceph_pagelist_encode_32(struct ceph_pagelist *pl, u32 v)
|
||||
{
|
||||
__le32 ev = cpu_to_le32(v);
|
||||
return ceph_pagelist_append(pl, &ev, sizeof(ev));
|
||||
}
|
||||
static inline int ceph_pagelist_encode_16(struct ceph_pagelist *pl, u16 v)
|
||||
{
|
||||
__le16 ev = cpu_to_le16(v);
|
||||
return ceph_pagelist_append(pl, &ev, sizeof(ev));
|
||||
}
|
||||
static inline int ceph_pagelist_encode_8(struct ceph_pagelist *pl, u8 v)
|
||||
{
|
||||
return ceph_pagelist_append(pl, &v, 1);
|
||||
}
|
||||
static inline int ceph_pagelist_encode_string(struct ceph_pagelist *pl,
|
||||
char *s, size_t len)
|
||||
{
|
||||
int ret = ceph_pagelist_encode_32(pl, len);
|
||||
if (ret)
|
||||
return ret;
|
||||
if (len)
|
||||
return ceph_pagelist_append(pl, s, len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif
|
374
fs/ceph/rados.h
Normal file
374
fs/ceph/rados.h
Normal file
@ -0,0 +1,374 @@
|
||||
#ifndef __RADOS_H
|
||||
#define __RADOS_H
|
||||
|
||||
/*
|
||||
* Data types for the Ceph distributed object storage layer RADOS
|
||||
* (Reliable Autonomic Distributed Object Store).
|
||||
*/
|
||||
|
||||
#include "msgr.h"
|
||||
|
||||
/*
|
||||
* osdmap encoding versions
|
||||
*/
|
||||
#define CEPH_OSDMAP_INC_VERSION 4
|
||||
#define CEPH_OSDMAP_VERSION 4
|
||||
|
||||
/*
|
||||
* fs id
|
||||
*/
|
||||
struct ceph_fsid {
|
||||
unsigned char fsid[16];
|
||||
};
|
||||
|
||||
static inline int ceph_fsid_compare(const struct ceph_fsid *a,
|
||||
const struct ceph_fsid *b)
|
||||
{
|
||||
return memcmp(a, b, sizeof(*a));
|
||||
}
|
||||
|
||||
/*
|
||||
* ino, object, etc.
|
||||
*/
|
||||
typedef __le64 ceph_snapid_t;
|
||||
#define CEPH_SNAPDIR ((__u64)(-1)) /* reserved for hidden .snap dir */
|
||||
#define CEPH_NOSNAP ((__u64)(-2)) /* "head", "live" revision */
|
||||
#define CEPH_MAXSNAP ((__u64)(-3)) /* largest valid snapid */
|
||||
|
||||
struct ceph_timespec {
|
||||
__le32 tv_sec;
|
||||
__le32 tv_nsec;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
/*
|
||||
* object layout - how objects are mapped into PGs
|
||||
*/
|
||||
#define CEPH_OBJECT_LAYOUT_HASH 1
|
||||
#define CEPH_OBJECT_LAYOUT_LINEAR 2
|
||||
#define CEPH_OBJECT_LAYOUT_HASHINO 3
|
||||
|
||||
/*
|
||||
* pg layout -- how PGs are mapped onto (sets of) OSDs
|
||||
*/
|
||||
#define CEPH_PG_LAYOUT_CRUSH 0
|
||||
#define CEPH_PG_LAYOUT_HASH 1
|
||||
#define CEPH_PG_LAYOUT_LINEAR 2
|
||||
#define CEPH_PG_LAYOUT_HYBRID 3
|
||||
|
||||
|
||||
/*
|
||||
* placement group.
|
||||
* we encode this into one __le64.
|
||||
*/
|
||||
struct ceph_pg {
|
||||
__le16 preferred; /* preferred primary osd */
|
||||
__le16 ps; /* placement seed */
|
||||
__le32 pool; /* object pool */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* pg_pool is a set of pgs storing a pool of objects
|
||||
*
|
||||
* pg_num -- base number of pseudorandomly placed pgs
|
||||
*
|
||||
* pgp_num -- effective number when calculating pg placement. this
|
||||
* is used for pg_num increases. new pgs result in data being "split"
|
||||
* into new pgs. for this to proceed smoothly, new pgs are intiially
|
||||
* colocated with their parents; that is, pgp_num doesn't increase
|
||||
* until the new pgs have successfully split. only _then_ are the new
|
||||
* pgs placed independently.
|
||||
*
|
||||
* lpg_num -- localized pg count (per device). replicas are randomly
|
||||
* selected.
|
||||
*
|
||||
* lpgp_num -- as above.
|
||||
*/
|
||||
#define CEPH_PG_TYPE_REP 1
|
||||
#define CEPH_PG_TYPE_RAID4 2
|
||||
#define CEPH_PG_POOL_VERSION 2
|
||||
struct ceph_pg_pool {
|
||||
__u8 type; /* CEPH_PG_TYPE_* */
|
||||
__u8 size; /* number of osds in each pg */
|
||||
__u8 crush_ruleset; /* crush placement rule */
|
||||
__u8 object_hash; /* hash mapping object name to ps */
|
||||
__le32 pg_num, pgp_num; /* number of pg's */
|
||||
__le32 lpg_num, lpgp_num; /* number of localized pg's */
|
||||
__le32 last_change; /* most recent epoch changed */
|
||||
__le64 snap_seq; /* seq for per-pool snapshot */
|
||||
__le32 snap_epoch; /* epoch of last snap */
|
||||
__le32 num_snaps;
|
||||
__le32 num_removed_snap_intervals;
|
||||
__le64 uid;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* stable_mod func is used to control number of placement groups.
|
||||
* similar to straight-up modulo, but produces a stable mapping as b
|
||||
* increases over time. b is the number of bins, and bmask is the
|
||||
* containing power of 2 minus 1.
|
||||
*
|
||||
* b <= bmask and bmask=(2**n)-1
|
||||
* e.g., b=12 -> bmask=15, b=123 -> bmask=127
|
||||
*/
|
||||
static inline int ceph_stable_mod(int x, int b, int bmask)
|
||||
{
|
||||
if ((x & bmask) < b)
|
||||
return x & bmask;
|
||||
else
|
||||
return x & (bmask >> 1);
|
||||
}
|
||||
|
||||
/*
|
||||
* object layout - how a given object should be stored.
|
||||
*/
|
||||
struct ceph_object_layout {
|
||||
struct ceph_pg ol_pgid; /* raw pg, with _full_ ps precision. */
|
||||
__le32 ol_stripe_unit; /* for per-object parity, if any */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* compound epoch+version, used by storage layer to serialize mutations
|
||||
*/
|
||||
struct ceph_eversion {
|
||||
__le32 epoch;
|
||||
__le64 version;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* osd map bits
|
||||
*/
|
||||
|
||||
/* status bits */
|
||||
#define CEPH_OSD_EXISTS 1
|
||||
#define CEPH_OSD_UP 2
|
||||
|
||||
/* osd weights. fixed point value: 0x10000 == 1.0 ("in"), 0 == "out" */
|
||||
#define CEPH_OSD_IN 0x10000
|
||||
#define CEPH_OSD_OUT 0
|
||||
|
||||
|
||||
/*
|
||||
* osd map flag bits
|
||||
*/
|
||||
#define CEPH_OSDMAP_NEARFULL (1<<0) /* sync writes (near ENOSPC) */
|
||||
#define CEPH_OSDMAP_FULL (1<<1) /* no data writes (ENOSPC) */
|
||||
#define CEPH_OSDMAP_PAUSERD (1<<2) /* pause all reads */
|
||||
#define CEPH_OSDMAP_PAUSEWR (1<<3) /* pause all writes */
|
||||
#define CEPH_OSDMAP_PAUSEREC (1<<4) /* pause recovery */
|
||||
|
||||
/*
|
||||
* osd ops
|
||||
*/
|
||||
#define CEPH_OSD_OP_MODE 0xf000
|
||||
#define CEPH_OSD_OP_MODE_RD 0x1000
|
||||
#define CEPH_OSD_OP_MODE_WR 0x2000
|
||||
#define CEPH_OSD_OP_MODE_RMW 0x3000
|
||||
#define CEPH_OSD_OP_MODE_SUB 0x4000
|
||||
|
||||
#define CEPH_OSD_OP_TYPE 0x0f00
|
||||
#define CEPH_OSD_OP_TYPE_LOCK 0x0100
|
||||
#define CEPH_OSD_OP_TYPE_DATA 0x0200
|
||||
#define CEPH_OSD_OP_TYPE_ATTR 0x0300
|
||||
#define CEPH_OSD_OP_TYPE_EXEC 0x0400
|
||||
#define CEPH_OSD_OP_TYPE_PG 0x0500
|
||||
|
||||
enum {
|
||||
/** data **/
|
||||
/* read */
|
||||
CEPH_OSD_OP_READ = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 1,
|
||||
CEPH_OSD_OP_STAT = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 2,
|
||||
|
||||
/* fancy read */
|
||||
CEPH_OSD_OP_MASKTRUNC = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 4,
|
||||
|
||||
/* write */
|
||||
CEPH_OSD_OP_WRITE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 1,
|
||||
CEPH_OSD_OP_WRITEFULL = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 2,
|
||||
CEPH_OSD_OP_TRUNCATE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 3,
|
||||
CEPH_OSD_OP_ZERO = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 4,
|
||||
CEPH_OSD_OP_DELETE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 5,
|
||||
|
||||
/* fancy write */
|
||||
CEPH_OSD_OP_APPEND = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 6,
|
||||
CEPH_OSD_OP_STARTSYNC = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 7,
|
||||
CEPH_OSD_OP_SETTRUNC = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 8,
|
||||
CEPH_OSD_OP_TRIMTRUNC = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 9,
|
||||
|
||||
CEPH_OSD_OP_TMAPUP = CEPH_OSD_OP_MODE_RMW | CEPH_OSD_OP_TYPE_DATA | 10,
|
||||
CEPH_OSD_OP_TMAPPUT = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 11,
|
||||
CEPH_OSD_OP_TMAPGET = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 12,
|
||||
|
||||
CEPH_OSD_OP_CREATE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 13,
|
||||
|
||||
/** attrs **/
|
||||
/* read */
|
||||
CEPH_OSD_OP_GETXATTR = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_ATTR | 1,
|
||||
CEPH_OSD_OP_GETXATTRS = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_ATTR | 2,
|
||||
|
||||
/* write */
|
||||
CEPH_OSD_OP_SETXATTR = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_ATTR | 1,
|
||||
CEPH_OSD_OP_SETXATTRS = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_ATTR | 2,
|
||||
CEPH_OSD_OP_RESETXATTRS = CEPH_OSD_OP_MODE_WR|CEPH_OSD_OP_TYPE_ATTR | 3,
|
||||
CEPH_OSD_OP_RMXATTR = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_ATTR | 4,
|
||||
|
||||
/** subop **/
|
||||
CEPH_OSD_OP_PULL = CEPH_OSD_OP_MODE_SUB | 1,
|
||||
CEPH_OSD_OP_PUSH = CEPH_OSD_OP_MODE_SUB | 2,
|
||||
CEPH_OSD_OP_BALANCEREADS = CEPH_OSD_OP_MODE_SUB | 3,
|
||||
CEPH_OSD_OP_UNBALANCEREADS = CEPH_OSD_OP_MODE_SUB | 4,
|
||||
CEPH_OSD_OP_SCRUB = CEPH_OSD_OP_MODE_SUB | 5,
|
||||
|
||||
/** lock **/
|
||||
CEPH_OSD_OP_WRLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 1,
|
||||
CEPH_OSD_OP_WRUNLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 2,
|
||||
CEPH_OSD_OP_RDLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 3,
|
||||
CEPH_OSD_OP_RDUNLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 4,
|
||||
CEPH_OSD_OP_UPLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 5,
|
||||
CEPH_OSD_OP_DNLOCK = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_LOCK | 6,
|
||||
|
||||
/** exec **/
|
||||
CEPH_OSD_OP_CALL = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_EXEC | 1,
|
||||
|
||||
/** pg **/
|
||||
CEPH_OSD_OP_PGLS = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_PG | 1,
|
||||
};
|
||||
|
||||
static inline int ceph_osd_op_type_lock(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_TYPE) == CEPH_OSD_OP_TYPE_LOCK;
|
||||
}
|
||||
static inline int ceph_osd_op_type_data(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_TYPE) == CEPH_OSD_OP_TYPE_DATA;
|
||||
}
|
||||
static inline int ceph_osd_op_type_attr(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_TYPE) == CEPH_OSD_OP_TYPE_ATTR;
|
||||
}
|
||||
static inline int ceph_osd_op_type_exec(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_TYPE) == CEPH_OSD_OP_TYPE_EXEC;
|
||||
}
|
||||
static inline int ceph_osd_op_type_pg(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_TYPE) == CEPH_OSD_OP_TYPE_PG;
|
||||
}
|
||||
|
||||
static inline int ceph_osd_op_mode_subop(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_MODE) == CEPH_OSD_OP_MODE_SUB;
|
||||
}
|
||||
static inline int ceph_osd_op_mode_read(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_MODE) == CEPH_OSD_OP_MODE_RD;
|
||||
}
|
||||
static inline int ceph_osd_op_mode_modify(int op)
|
||||
{
|
||||
return (op & CEPH_OSD_OP_MODE) == CEPH_OSD_OP_MODE_WR;
|
||||
}
|
||||
|
||||
#define CEPH_OSD_TMAP_HDR 'h'
|
||||
#define CEPH_OSD_TMAP_SET 's'
|
||||
#define CEPH_OSD_TMAP_RM 'r'
|
||||
|
||||
extern const char *ceph_osd_op_name(int op);
|
||||
|
||||
|
||||
/*
|
||||
* osd op flags
|
||||
*
|
||||
* An op may be READ, WRITE, or READ|WRITE.
|
||||
*/
|
||||
enum {
|
||||
CEPH_OSD_FLAG_ACK = 1, /* want (or is) "ack" ack */
|
||||
CEPH_OSD_FLAG_ONNVRAM = 2, /* want (or is) "onnvram" ack */
|
||||
CEPH_OSD_FLAG_ONDISK = 4, /* want (or is) "ondisk" ack */
|
||||
CEPH_OSD_FLAG_RETRY = 8, /* resend attempt */
|
||||
CEPH_OSD_FLAG_READ = 16, /* op may read */
|
||||
CEPH_OSD_FLAG_WRITE = 32, /* op may write */
|
||||
CEPH_OSD_FLAG_ORDERSNAP = 64, /* EOLDSNAP if snapc is out of order */
|
||||
CEPH_OSD_FLAG_PEERSTAT = 128, /* msg includes osd_peer_stat */
|
||||
CEPH_OSD_FLAG_BALANCE_READS = 256,
|
||||
CEPH_OSD_FLAG_PARALLELEXEC = 512, /* execute op in parallel */
|
||||
CEPH_OSD_FLAG_PGOP = 1024, /* pg op, no object */
|
||||
CEPH_OSD_FLAG_EXEC = 2048, /* op may exec */
|
||||
};
|
||||
|
||||
enum {
|
||||
CEPH_OSD_OP_FLAG_EXCL = 1, /* EXCL object create */
|
||||
};
|
||||
|
||||
#define EOLDSNAPC ERESTART /* ORDERSNAP flag set; writer has old snapc*/
|
||||
#define EBLACKLISTED ESHUTDOWN /* blacklisted */
|
||||
|
||||
/*
|
||||
* an individual object operation. each may be accompanied by some data
|
||||
* payload
|
||||
*/
|
||||
struct ceph_osd_op {
|
||||
__le16 op; /* CEPH_OSD_OP_* */
|
||||
__le32 flags; /* CEPH_OSD_FLAG_* */
|
||||
union {
|
||||
struct {
|
||||
__le64 offset, length;
|
||||
__le64 truncate_size;
|
||||
__le32 truncate_seq;
|
||||
} __attribute__ ((packed)) extent;
|
||||
struct {
|
||||
__le32 name_len;
|
||||
__le32 value_len;
|
||||
} __attribute__ ((packed)) xattr;
|
||||
struct {
|
||||
__u8 class_len;
|
||||
__u8 method_len;
|
||||
__u8 argc;
|
||||
__le32 indata_len;
|
||||
} __attribute__ ((packed)) cls;
|
||||
struct {
|
||||
__le64 cookie, count;
|
||||
} __attribute__ ((packed)) pgls;
|
||||
};
|
||||
__le32 payload_len;
|
||||
} __attribute__ ((packed));
|
||||
|
||||
/*
|
||||
* osd request message header. each request may include multiple
|
||||
* ceph_osd_op object operations.
|
||||
*/
|
||||
struct ceph_osd_request_head {
|
||||
__le32 client_inc; /* client incarnation */
|
||||
struct ceph_object_layout layout; /* pgid */
|
||||
__le32 osdmap_epoch; /* client's osdmap epoch */
|
||||
|
||||
__le32 flags;
|
||||
|
||||
struct ceph_timespec mtime; /* for mutations only */
|
||||
struct ceph_eversion reassert_version; /* if we are replaying op */
|
||||
|
||||
__le32 object_len; /* length of object name */
|
||||
|
||||
__le64 snapid; /* snapid to read */
|
||||
__le64 snap_seq; /* writer's snap context */
|
||||
__le32 num_snaps;
|
||||
|
||||
__le16 num_ops;
|
||||
struct ceph_osd_op ops[]; /* followed by ops[], obj, ticket, snaps */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
struct ceph_osd_reply_head {
|
||||
__le32 client_inc; /* client incarnation */
|
||||
__le32 flags;
|
||||
struct ceph_object_layout layout;
|
||||
__le32 osdmap_epoch;
|
||||
struct ceph_eversion reassert_version; /* for replaying uncommitted */
|
||||
|
||||
__le32 result; /* result code */
|
||||
|
||||
__le32 object_len; /* length of object name */
|
||||
__le32 num_ops;
|
||||
struct ceph_osd_op ops[0]; /* ops[], object */
|
||||
} __attribute__ ((packed));
|
||||
|
||||
|
||||
#endif
|
904
fs/ceph/snap.c
Normal file
904
fs/ceph/snap.c
Normal file
@ -0,0 +1,904 @@
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <linux/sort.h>
|
||||
|
||||
#include "super.h"
|
||||
#include "decode.h"
|
||||
|
||||
/*
|
||||
* Snapshots in ceph are driven in large part by cooperation from the
|
||||
* client. In contrast to local file systems or file servers that
|
||||
* implement snapshots at a single point in the system, ceph's
|
||||
* distributed access to storage requires clients to help decide
|
||||
* whether a write logically occurs before or after a recently created
|
||||
* snapshot.
|
||||
*
|
||||
* This provides a perfect instantanous client-wide snapshot. Between
|
||||
* clients, however, snapshots may appear to be applied at slightly
|
||||
* different points in time, depending on delays in delivering the
|
||||
* snapshot notification.
|
||||
*
|
||||
* Snapshots are _not_ file system-wide. Instead, each snapshot
|
||||
* applies to the subdirectory nested beneath some directory. This
|
||||
* effectively divides the hierarchy into multiple "realms," where all
|
||||
* of the files contained by each realm share the same set of
|
||||
* snapshots. An individual realm's snap set contains snapshots
|
||||
* explicitly created on that realm, as well as any snaps in its
|
||||
* parent's snap set _after_ the point at which the parent became it's
|
||||
* parent (due to, say, a rename). Similarly, snaps from prior parents
|
||||
* during the time intervals during which they were the parent are included.
|
||||
*
|
||||
* The client is spared most of this detail, fortunately... it must only
|
||||
* maintains a hierarchy of realms reflecting the current parent/child
|
||||
* realm relationship, and for each realm has an explicit list of snaps
|
||||
* inherited from prior parents.
|
||||
*
|
||||
* A snap_realm struct is maintained for realms containing every inode
|
||||
* with an open cap in the system. (The needed snap realm information is
|
||||
* provided by the MDS whenever a cap is issued, i.e., on open.) A 'seq'
|
||||
* version number is used to ensure that as realm parameters change (new
|
||||
* snapshot, new parent, etc.) the client's realm hierarchy is updated.
|
||||
*
|
||||
* The realm hierarchy drives the generation of a 'snap context' for each
|
||||
* realm, which simply lists the resulting set of snaps for the realm. This
|
||||
* is attached to any writes sent to OSDs.
|
||||
*/
|
||||
/*
|
||||
* Unfortunately error handling is a bit mixed here. If we get a snap
|
||||
* update, but don't have enough memory to update our realm hierarchy,
|
||||
* it's not clear what we can do about it (besides complaining to the
|
||||
* console).
|
||||
*/
|
||||
|
||||
|
||||
/*
|
||||
* increase ref count for the realm
|
||||
*
|
||||
* caller must hold snap_rwsem for write.
|
||||
*/
|
||||
void ceph_get_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm)
|
||||
{
|
||||
dout("get_realm %p %d -> %d\n", realm,
|
||||
atomic_read(&realm->nref), atomic_read(&realm->nref)+1);
|
||||
/*
|
||||
* since we _only_ increment realm refs or empty the empty
|
||||
* list with snap_rwsem held, adjusting the empty list here is
|
||||
* safe. we do need to protect against concurrent empty list
|
||||
* additions, however.
|
||||
*/
|
||||
if (atomic_read(&realm->nref) == 0) {
|
||||
spin_lock(&mdsc->snap_empty_lock);
|
||||
list_del_init(&realm->empty_item);
|
||||
spin_unlock(&mdsc->snap_empty_lock);
|
||||
}
|
||||
|
||||
atomic_inc(&realm->nref);
|
||||
}
|
||||
|
||||
static void __insert_snap_realm(struct rb_root *root,
|
||||
struct ceph_snap_realm *new)
|
||||
{
|
||||
struct rb_node **p = &root->rb_node;
|
||||
struct rb_node *parent = NULL;
|
||||
struct ceph_snap_realm *r = NULL;
|
||||
|
||||
while (*p) {
|
||||
parent = *p;
|
||||
r = rb_entry(parent, struct ceph_snap_realm, node);
|
||||
if (new->ino < r->ino)
|
||||
p = &(*p)->rb_left;
|
||||
else if (new->ino > r->ino)
|
||||
p = &(*p)->rb_right;
|
||||
else
|
||||
BUG();
|
||||
}
|
||||
|
||||
rb_link_node(&new->node, parent, p);
|
||||
rb_insert_color(&new->node, root);
|
||||
}
|
||||
|
||||
/*
|
||||
* create and get the realm rooted at @ino and bump its ref count.
|
||||
*
|
||||
* caller must hold snap_rwsem for write.
|
||||
*/
|
||||
static struct ceph_snap_realm *ceph_create_snap_realm(
|
||||
struct ceph_mds_client *mdsc,
|
||||
u64 ino)
|
||||
{
|
||||
struct ceph_snap_realm *realm;
|
||||
|
||||
realm = kzalloc(sizeof(*realm), GFP_NOFS);
|
||||
if (!realm)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
|
||||
atomic_set(&realm->nref, 0); /* tree does not take a ref */
|
||||
realm->ino = ino;
|
||||
INIT_LIST_HEAD(&realm->children);
|
||||
INIT_LIST_HEAD(&realm->child_item);
|
||||
INIT_LIST_HEAD(&realm->empty_item);
|
||||
INIT_LIST_HEAD(&realm->inodes_with_caps);
|
||||
spin_lock_init(&realm->inodes_with_caps_lock);
|
||||
__insert_snap_realm(&mdsc->snap_realms, realm);
|
||||
dout("create_snap_realm %llx %p\n", realm->ino, realm);
|
||||
return realm;
|
||||
}
|
||||
|
||||
/*
|
||||
* lookup the realm rooted at @ino.
|
||||
*
|
||||
* caller must hold snap_rwsem for write.
|
||||
*/
|
||||
struct ceph_snap_realm *ceph_lookup_snap_realm(struct ceph_mds_client *mdsc,
|
||||
u64 ino)
|
||||
{
|
||||
struct rb_node *n = mdsc->snap_realms.rb_node;
|
||||
struct ceph_snap_realm *r;
|
||||
|
||||
while (n) {
|
||||
r = rb_entry(n, struct ceph_snap_realm, node);
|
||||
if (ino < r->ino)
|
||||
n = n->rb_left;
|
||||
else if (ino > r->ino)
|
||||
n = n->rb_right;
|
||||
else {
|
||||
dout("lookup_snap_realm %llx %p\n", r->ino, r);
|
||||
return r;
|
||||
}
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static void __put_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm);
|
||||
|
||||
/*
|
||||
* called with snap_rwsem (write)
|
||||
*/
|
||||
static void __destroy_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm)
|
||||
{
|
||||
dout("__destroy_snap_realm %p %llx\n", realm, realm->ino);
|
||||
|
||||
rb_erase(&realm->node, &mdsc->snap_realms);
|
||||
|
||||
if (realm->parent) {
|
||||
list_del_init(&realm->child_item);
|
||||
__put_snap_realm(mdsc, realm->parent);
|
||||
}
|
||||
|
||||
kfree(realm->prior_parent_snaps);
|
||||
kfree(realm->snaps);
|
||||
ceph_put_snap_context(realm->cached_context);
|
||||
kfree(realm);
|
||||
}
|
||||
|
||||
/*
|
||||
* caller holds snap_rwsem (write)
|
||||
*/
|
||||
static void __put_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm)
|
||||
{
|
||||
dout("__put_snap_realm %llx %p %d -> %d\n", realm->ino, realm,
|
||||
atomic_read(&realm->nref), atomic_read(&realm->nref)-1);
|
||||
if (atomic_dec_and_test(&realm->nref))
|
||||
__destroy_snap_realm(mdsc, realm);
|
||||
}
|
||||
|
||||
/*
|
||||
* caller needn't hold any locks
|
||||
*/
|
||||
void ceph_put_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm)
|
||||
{
|
||||
dout("put_snap_realm %llx %p %d -> %d\n", realm->ino, realm,
|
||||
atomic_read(&realm->nref), atomic_read(&realm->nref)-1);
|
||||
if (!atomic_dec_and_test(&realm->nref))
|
||||
return;
|
||||
|
||||
if (down_write_trylock(&mdsc->snap_rwsem)) {
|
||||
__destroy_snap_realm(mdsc, realm);
|
||||
up_write(&mdsc->snap_rwsem);
|
||||
} else {
|
||||
spin_lock(&mdsc->snap_empty_lock);
|
||||
list_add(&mdsc->snap_empty, &realm->empty_item);
|
||||
spin_unlock(&mdsc->snap_empty_lock);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Clean up any realms whose ref counts have dropped to zero. Note
|
||||
* that this does not include realms who were created but not yet
|
||||
* used.
|
||||
*
|
||||
* Called under snap_rwsem (write)
|
||||
*/
|
||||
static void __cleanup_empty_realms(struct ceph_mds_client *mdsc)
|
||||
{
|
||||
struct ceph_snap_realm *realm;
|
||||
|
||||
spin_lock(&mdsc->snap_empty_lock);
|
||||
while (!list_empty(&mdsc->snap_empty)) {
|
||||
realm = list_first_entry(&mdsc->snap_empty,
|
||||
struct ceph_snap_realm, empty_item);
|
||||
list_del(&realm->empty_item);
|
||||
spin_unlock(&mdsc->snap_empty_lock);
|
||||
__destroy_snap_realm(mdsc, realm);
|
||||
spin_lock(&mdsc->snap_empty_lock);
|
||||
}
|
||||
spin_unlock(&mdsc->snap_empty_lock);
|
||||
}
|
||||
|
||||
void ceph_cleanup_empty_realms(struct ceph_mds_client *mdsc)
|
||||
{
|
||||
down_write(&mdsc->snap_rwsem);
|
||||
__cleanup_empty_realms(mdsc);
|
||||
up_write(&mdsc->snap_rwsem);
|
||||
}
|
||||
|
||||
/*
|
||||
* adjust the parent realm of a given @realm. adjust child list, and parent
|
||||
* pointers, and ref counts appropriately.
|
||||
*
|
||||
* return true if parent was changed, 0 if unchanged, <0 on error.
|
||||
*
|
||||
* caller must hold snap_rwsem for write.
|
||||
*/
|
||||
static int adjust_snap_realm_parent(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm,
|
||||
u64 parentino)
|
||||
{
|
||||
struct ceph_snap_realm *parent;
|
||||
|
||||
if (realm->parent_ino == parentino)
|
||||
return 0;
|
||||
|
||||
parent = ceph_lookup_snap_realm(mdsc, parentino);
|
||||
if (!parent) {
|
||||
parent = ceph_create_snap_realm(mdsc, parentino);
|
||||
if (IS_ERR(parent))
|
||||
return PTR_ERR(parent);
|
||||
}
|
||||
dout("adjust_snap_realm_parent %llx %p: %llx %p -> %llx %p\n",
|
||||
realm->ino, realm, realm->parent_ino, realm->parent,
|
||||
parentino, parent);
|
||||
if (realm->parent) {
|
||||
list_del_init(&realm->child_item);
|
||||
ceph_put_snap_realm(mdsc, realm->parent);
|
||||
}
|
||||
realm->parent_ino = parentino;
|
||||
realm->parent = parent;
|
||||
ceph_get_snap_realm(mdsc, parent);
|
||||
list_add(&realm->child_item, &parent->children);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
static int cmpu64_rev(const void *a, const void *b)
|
||||
{
|
||||
if (*(u64 *)a < *(u64 *)b)
|
||||
return 1;
|
||||
if (*(u64 *)a > *(u64 *)b)
|
||||
return -1;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* build the snap context for a given realm.
|
||||
*/
|
||||
static int build_snap_context(struct ceph_snap_realm *realm)
|
||||
{
|
||||
struct ceph_snap_realm *parent = realm->parent;
|
||||
struct ceph_snap_context *snapc;
|
||||
int err = 0;
|
||||
int i;
|
||||
int num = realm->num_prior_parent_snaps + realm->num_snaps;
|
||||
|
||||
/*
|
||||
* build parent context, if it hasn't been built.
|
||||
* conservatively estimate that all parent snaps might be
|
||||
* included by us.
|
||||
*/
|
||||
if (parent) {
|
||||
if (!parent->cached_context) {
|
||||
err = build_snap_context(parent);
|
||||
if (err)
|
||||
goto fail;
|
||||
}
|
||||
num += parent->cached_context->num_snaps;
|
||||
}
|
||||
|
||||
/* do i actually need to update? not if my context seq
|
||||
matches realm seq, and my parents' does to. (this works
|
||||
because we rebuild_snap_realms() works _downward_ in
|
||||
hierarchy after each update.) */
|
||||
if (realm->cached_context &&
|
||||
realm->cached_context->seq <= realm->seq &&
|
||||
(!parent ||
|
||||
realm->cached_context->seq <= parent->cached_context->seq)) {
|
||||
dout("build_snap_context %llx %p: %p seq %lld (%d snaps)"
|
||||
" (unchanged)\n",
|
||||
realm->ino, realm, realm->cached_context,
|
||||
realm->cached_context->seq,
|
||||
realm->cached_context->num_snaps);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* alloc new snap context */
|
||||
err = -ENOMEM;
|
||||
if (num > ULONG_MAX / sizeof(u64) - sizeof(*snapc))
|
||||
goto fail;
|
||||
snapc = kzalloc(sizeof(*snapc) + num*sizeof(u64), GFP_NOFS);
|
||||
if (!snapc)
|
||||
goto fail;
|
||||
atomic_set(&snapc->nref, 1);
|
||||
|
||||
/* build (reverse sorted) snap vector */
|
||||
num = 0;
|
||||
snapc->seq = realm->seq;
|
||||
if (parent) {
|
||||
/* include any of parent's snaps occuring _after_ my
|
||||
parent became my parent */
|
||||
for (i = 0; i < parent->cached_context->num_snaps; i++)
|
||||
if (parent->cached_context->snaps[i] >=
|
||||
realm->parent_since)
|
||||
snapc->snaps[num++] =
|
||||
parent->cached_context->snaps[i];
|
||||
if (parent->cached_context->seq > snapc->seq)
|
||||
snapc->seq = parent->cached_context->seq;
|
||||
}
|
||||
memcpy(snapc->snaps + num, realm->snaps,
|
||||
sizeof(u64)*realm->num_snaps);
|
||||
num += realm->num_snaps;
|
||||
memcpy(snapc->snaps + num, realm->prior_parent_snaps,
|
||||
sizeof(u64)*realm->num_prior_parent_snaps);
|
||||
num += realm->num_prior_parent_snaps;
|
||||
|
||||
sort(snapc->snaps, num, sizeof(u64), cmpu64_rev, NULL);
|
||||
snapc->num_snaps = num;
|
||||
dout("build_snap_context %llx %p: %p seq %lld (%d snaps)\n",
|
||||
realm->ino, realm, snapc, snapc->seq, snapc->num_snaps);
|
||||
|
||||
if (realm->cached_context)
|
||||
ceph_put_snap_context(realm->cached_context);
|
||||
realm->cached_context = snapc;
|
||||
return 0;
|
||||
|
||||
fail:
|
||||
/*
|
||||
* if we fail, clear old (incorrect) cached_context... hopefully
|
||||
* we'll have better luck building it later
|
||||
*/
|
||||
if (realm->cached_context) {
|
||||
ceph_put_snap_context(realm->cached_context);
|
||||
realm->cached_context = NULL;
|
||||
}
|
||||
pr_err("build_snap_context %llx %p fail %d\n", realm->ino,
|
||||
realm, err);
|
||||
return err;
|
||||
}
|
||||
|
||||
/*
|
||||
* rebuild snap context for the given realm and all of its children.
|
||||
*/
|
||||
static void rebuild_snap_realms(struct ceph_snap_realm *realm)
|
||||
{
|
||||
struct ceph_snap_realm *child;
|
||||
|
||||
dout("rebuild_snap_realms %llx %p\n", realm->ino, realm);
|
||||
build_snap_context(realm);
|
||||
|
||||
list_for_each_entry(child, &realm->children, child_item)
|
||||
rebuild_snap_realms(child);
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* helper to allocate and decode an array of snapids. free prior
|
||||
* instance, if any.
|
||||
*/
|
||||
static int dup_array(u64 **dst, __le64 *src, int num)
|
||||
{
|
||||
int i;
|
||||
|
||||
kfree(*dst);
|
||||
if (num) {
|
||||
*dst = kcalloc(num, sizeof(u64), GFP_NOFS);
|
||||
if (!*dst)
|
||||
return -ENOMEM;
|
||||
for (i = 0; i < num; i++)
|
||||
(*dst)[i] = get_unaligned_le64(src + i);
|
||||
} else {
|
||||
*dst = NULL;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* When a snapshot is applied, the size/mtime inode metadata is queued
|
||||
* in a ceph_cap_snap (one for each snapshot) until writeback
|
||||
* completes and the metadata can be flushed back to the MDS.
|
||||
*
|
||||
* However, if a (sync) write is currently in-progress when we apply
|
||||
* the snapshot, we have to wait until the write succeeds or fails
|
||||
* (and a final size/mtime is known). In this case the
|
||||
* cap_snap->writing = 1, and is said to be "pending." When the write
|
||||
* finishes, we __ceph_finish_cap_snap().
|
||||
*
|
||||
* Caller must hold snap_rwsem for read (i.e., the realm topology won't
|
||||
* change).
|
||||
*/
|
||||
void ceph_queue_cap_snap(struct ceph_inode_info *ci,
|
||||
struct ceph_snap_context *snapc)
|
||||
{
|
||||
struct inode *inode = &ci->vfs_inode;
|
||||
struct ceph_cap_snap *capsnap;
|
||||
int used;
|
||||
|
||||
capsnap = kzalloc(sizeof(*capsnap), GFP_NOFS);
|
||||
if (!capsnap) {
|
||||
pr_err("ENOMEM allocating ceph_cap_snap on %p\n", inode);
|
||||
return;
|
||||
}
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
used = __ceph_caps_used(ci);
|
||||
if (__ceph_have_pending_cap_snap(ci)) {
|
||||
/* there is no point in queuing multiple "pending" cap_snaps,
|
||||
as no new writes are allowed to start when pending, so any
|
||||
writes in progress now were started before the previous
|
||||
cap_snap. lucky us. */
|
||||
dout("queue_cap_snap %p snapc %p seq %llu used %d"
|
||||
" already pending\n", inode, snapc, snapc->seq, used);
|
||||
kfree(capsnap);
|
||||
} else if (ci->i_wrbuffer_ref_head || (used & CEPH_CAP_FILE_WR)) {
|
||||
igrab(inode);
|
||||
|
||||
atomic_set(&capsnap->nref, 1);
|
||||
capsnap->ci = ci;
|
||||
INIT_LIST_HEAD(&capsnap->ci_item);
|
||||
INIT_LIST_HEAD(&capsnap->flushing_item);
|
||||
|
||||
capsnap->follows = snapc->seq - 1;
|
||||
capsnap->context = ceph_get_snap_context(snapc);
|
||||
capsnap->issued = __ceph_caps_issued(ci, NULL);
|
||||
capsnap->dirty = __ceph_caps_dirty(ci);
|
||||
|
||||
capsnap->mode = inode->i_mode;
|
||||
capsnap->uid = inode->i_uid;
|
||||
capsnap->gid = inode->i_gid;
|
||||
|
||||
/* fixme? */
|
||||
capsnap->xattr_blob = NULL;
|
||||
capsnap->xattr_len = 0;
|
||||
|
||||
/* dirty page count moved from _head to this cap_snap;
|
||||
all subsequent writes page dirties occur _after_ this
|
||||
snapshot. */
|
||||
capsnap->dirty_pages = ci->i_wrbuffer_ref_head;
|
||||
ci->i_wrbuffer_ref_head = 0;
|
||||
ceph_put_snap_context(ci->i_head_snapc);
|
||||
ci->i_head_snapc = NULL;
|
||||
list_add_tail(&capsnap->ci_item, &ci->i_cap_snaps);
|
||||
|
||||
if (used & CEPH_CAP_FILE_WR) {
|
||||
dout("queue_cap_snap %p cap_snap %p snapc %p"
|
||||
" seq %llu used WR, now pending\n", inode,
|
||||
capsnap, snapc, snapc->seq);
|
||||
capsnap->writing = 1;
|
||||
} else {
|
||||
/* note mtime, size NOW. */
|
||||
__ceph_finish_cap_snap(ci, capsnap);
|
||||
}
|
||||
} else {
|
||||
dout("queue_cap_snap %p nothing dirty|writing\n", inode);
|
||||
kfree(capsnap);
|
||||
}
|
||||
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
|
||||
/*
|
||||
* Finalize the size, mtime for a cap_snap.. that is, settle on final values
|
||||
* to be used for the snapshot, to be flushed back to the mds.
|
||||
*
|
||||
* If capsnap can now be flushed, add to snap_flush list, and return 1.
|
||||
*
|
||||
* Caller must hold i_lock.
|
||||
*/
|
||||
int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
|
||||
struct ceph_cap_snap *capsnap)
|
||||
{
|
||||
struct inode *inode = &ci->vfs_inode;
|
||||
struct ceph_mds_client *mdsc = &ceph_client(inode->i_sb)->mdsc;
|
||||
|
||||
BUG_ON(capsnap->writing);
|
||||
capsnap->size = inode->i_size;
|
||||
capsnap->mtime = inode->i_mtime;
|
||||
capsnap->atime = inode->i_atime;
|
||||
capsnap->ctime = inode->i_ctime;
|
||||
capsnap->time_warp_seq = ci->i_time_warp_seq;
|
||||
if (capsnap->dirty_pages) {
|
||||
dout("finish_cap_snap %p cap_snap %p snapc %p %llu s=%llu "
|
||||
"still has %d dirty pages\n", inode, capsnap,
|
||||
capsnap->context, capsnap->context->seq,
|
||||
capsnap->size, capsnap->dirty_pages);
|
||||
return 0;
|
||||
}
|
||||
dout("finish_cap_snap %p cap_snap %p snapc %p %llu s=%llu clean\n",
|
||||
inode, capsnap, capsnap->context,
|
||||
capsnap->context->seq, capsnap->size);
|
||||
|
||||
spin_lock(&mdsc->snap_flush_lock);
|
||||
list_add_tail(&ci->i_snap_flush_item, &mdsc->snap_flush_list);
|
||||
spin_unlock(&mdsc->snap_flush_lock);
|
||||
return 1; /* caller may want to ceph_flush_snaps */
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Parse and apply a snapblob "snap trace" from the MDS. This specifies
|
||||
* the snap realm parameters from a given realm and all of its ancestors,
|
||||
* up to the root.
|
||||
*
|
||||
* Caller must hold snap_rwsem for write.
|
||||
*/
|
||||
int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
|
||||
void *p, void *e, bool deletion)
|
||||
{
|
||||
struct ceph_mds_snap_realm *ri; /* encoded */
|
||||
__le64 *snaps; /* encoded */
|
||||
__le64 *prior_parent_snaps; /* encoded */
|
||||
struct ceph_snap_realm *realm;
|
||||
int invalidate = 0;
|
||||
int err = -ENOMEM;
|
||||
|
||||
dout("update_snap_trace deletion=%d\n", deletion);
|
||||
more:
|
||||
ceph_decode_need(&p, e, sizeof(*ri), bad);
|
||||
ri = p;
|
||||
p += sizeof(*ri);
|
||||
ceph_decode_need(&p, e, sizeof(u64)*(le32_to_cpu(ri->num_snaps) +
|
||||
le32_to_cpu(ri->num_prior_parent_snaps)), bad);
|
||||
snaps = p;
|
||||
p += sizeof(u64) * le32_to_cpu(ri->num_snaps);
|
||||
prior_parent_snaps = p;
|
||||
p += sizeof(u64) * le32_to_cpu(ri->num_prior_parent_snaps);
|
||||
|
||||
realm = ceph_lookup_snap_realm(mdsc, le64_to_cpu(ri->ino));
|
||||
if (!realm) {
|
||||
realm = ceph_create_snap_realm(mdsc, le64_to_cpu(ri->ino));
|
||||
if (IS_ERR(realm)) {
|
||||
err = PTR_ERR(realm);
|
||||
goto fail;
|
||||
}
|
||||
}
|
||||
|
||||
if (le64_to_cpu(ri->seq) > realm->seq) {
|
||||
dout("update_snap_trace updating %llx %p %lld -> %lld\n",
|
||||
realm->ino, realm, realm->seq, le64_to_cpu(ri->seq));
|
||||
/*
|
||||
* if the realm seq has changed, queue a cap_snap for every
|
||||
* inode with open caps. we do this _before_ we update
|
||||
* the realm info so that we prepare for writeback under the
|
||||
* _previous_ snap context.
|
||||
*
|
||||
* ...unless it's a snap deletion!
|
||||
*/
|
||||
if (!deletion) {
|
||||
struct ceph_inode_info *ci;
|
||||
struct inode *lastinode = NULL;
|
||||
|
||||
spin_lock(&realm->inodes_with_caps_lock);
|
||||
list_for_each_entry(ci, &realm->inodes_with_caps,
|
||||
i_snap_realm_item) {
|
||||
struct inode *inode = igrab(&ci->vfs_inode);
|
||||
if (!inode)
|
||||
continue;
|
||||
spin_unlock(&realm->inodes_with_caps_lock);
|
||||
if (lastinode)
|
||||
iput(lastinode);
|
||||
lastinode = inode;
|
||||
ceph_queue_cap_snap(ci, realm->cached_context);
|
||||
spin_lock(&realm->inodes_with_caps_lock);
|
||||
}
|
||||
spin_unlock(&realm->inodes_with_caps_lock);
|
||||
if (lastinode)
|
||||
iput(lastinode);
|
||||
dout("update_snap_trace cap_snaps queued\n");
|
||||
}
|
||||
|
||||
} else {
|
||||
dout("update_snap_trace %llx %p seq %lld unchanged\n",
|
||||
realm->ino, realm, realm->seq);
|
||||
}
|
||||
|
||||
/* ensure the parent is correct */
|
||||
err = adjust_snap_realm_parent(mdsc, realm, le64_to_cpu(ri->parent));
|
||||
if (err < 0)
|
||||
goto fail;
|
||||
invalidate += err;
|
||||
|
||||
if (le64_to_cpu(ri->seq) > realm->seq) {
|
||||
/* update realm parameters, snap lists */
|
||||
realm->seq = le64_to_cpu(ri->seq);
|
||||
realm->created = le64_to_cpu(ri->created);
|
||||
realm->parent_since = le64_to_cpu(ri->parent_since);
|
||||
|
||||
realm->num_snaps = le32_to_cpu(ri->num_snaps);
|
||||
err = dup_array(&realm->snaps, snaps, realm->num_snaps);
|
||||
if (err < 0)
|
||||
goto fail;
|
||||
|
||||
realm->num_prior_parent_snaps =
|
||||
le32_to_cpu(ri->num_prior_parent_snaps);
|
||||
err = dup_array(&realm->prior_parent_snaps, prior_parent_snaps,
|
||||
realm->num_prior_parent_snaps);
|
||||
if (err < 0)
|
||||
goto fail;
|
||||
|
||||
invalidate = 1;
|
||||
} else if (!realm->cached_context) {
|
||||
invalidate = 1;
|
||||
}
|
||||
|
||||
dout("done with %llx %p, invalidated=%d, %p %p\n", realm->ino,
|
||||
realm, invalidate, p, e);
|
||||
|
||||
if (p < e)
|
||||
goto more;
|
||||
|
||||
/* invalidate when we reach the _end_ (root) of the trace */
|
||||
if (invalidate)
|
||||
rebuild_snap_realms(realm);
|
||||
|
||||
__cleanup_empty_realms(mdsc);
|
||||
return 0;
|
||||
|
||||
bad:
|
||||
err = -EINVAL;
|
||||
fail:
|
||||
pr_err("update_snap_trace error %d\n", err);
|
||||
return err;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Send any cap_snaps that are queued for flush. Try to carry
|
||||
* s_mutex across multiple snap flushes to avoid locking overhead.
|
||||
*
|
||||
* Caller holds no locks.
|
||||
*/
|
||||
static void flush_snaps(struct ceph_mds_client *mdsc)
|
||||
{
|
||||
struct ceph_inode_info *ci;
|
||||
struct inode *inode;
|
||||
struct ceph_mds_session *session = NULL;
|
||||
|
||||
dout("flush_snaps\n");
|
||||
spin_lock(&mdsc->snap_flush_lock);
|
||||
while (!list_empty(&mdsc->snap_flush_list)) {
|
||||
ci = list_first_entry(&mdsc->snap_flush_list,
|
||||
struct ceph_inode_info, i_snap_flush_item);
|
||||
inode = &ci->vfs_inode;
|
||||
igrab(inode);
|
||||
spin_unlock(&mdsc->snap_flush_lock);
|
||||
spin_lock(&inode->i_lock);
|
||||
__ceph_flush_snaps(ci, &session);
|
||||
spin_unlock(&inode->i_lock);
|
||||
iput(inode);
|
||||
spin_lock(&mdsc->snap_flush_lock);
|
||||
}
|
||||
spin_unlock(&mdsc->snap_flush_lock);
|
||||
|
||||
if (session) {
|
||||
mutex_unlock(&session->s_mutex);
|
||||
ceph_put_mds_session(session);
|
||||
}
|
||||
dout("flush_snaps done\n");
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Handle a snap notification from the MDS.
|
||||
*
|
||||
* This can take two basic forms: the simplest is just a snap creation
|
||||
* or deletion notification on an existing realm. This should update the
|
||||
* realm and its children.
|
||||
*
|
||||
* The more difficult case is realm creation, due to snap creation at a
|
||||
* new point in the file hierarchy, or due to a rename that moves a file or
|
||||
* directory into another realm.
|
||||
*/
|
||||
void ceph_handle_snap(struct ceph_mds_client *mdsc,
|
||||
struct ceph_mds_session *session,
|
||||
struct ceph_msg *msg)
|
||||
{
|
||||
struct super_block *sb = mdsc->client->sb;
|
||||
int mds = session->s_mds;
|
||||
u64 split;
|
||||
int op;
|
||||
int trace_len;
|
||||
struct ceph_snap_realm *realm = NULL;
|
||||
void *p = msg->front.iov_base;
|
||||
void *e = p + msg->front.iov_len;
|
||||
struct ceph_mds_snap_head *h;
|
||||
int num_split_inos, num_split_realms;
|
||||
__le64 *split_inos = NULL, *split_realms = NULL;
|
||||
int i;
|
||||
int locked_rwsem = 0;
|
||||
|
||||
/* decode */
|
||||
if (msg->front.iov_len < sizeof(*h))
|
||||
goto bad;
|
||||
h = p;
|
||||
op = le32_to_cpu(h->op);
|
||||
split = le64_to_cpu(h->split); /* non-zero if we are splitting an
|
||||
* existing realm */
|
||||
num_split_inos = le32_to_cpu(h->num_split_inos);
|
||||
num_split_realms = le32_to_cpu(h->num_split_realms);
|
||||
trace_len = le32_to_cpu(h->trace_len);
|
||||
p += sizeof(*h);
|
||||
|
||||
dout("handle_snap from mds%d op %s split %llx tracelen %d\n", mds,
|
||||
ceph_snap_op_name(op), split, trace_len);
|
||||
|
||||
mutex_lock(&session->s_mutex);
|
||||
session->s_seq++;
|
||||
mutex_unlock(&session->s_mutex);
|
||||
|
||||
down_write(&mdsc->snap_rwsem);
|
||||
locked_rwsem = 1;
|
||||
|
||||
if (op == CEPH_SNAP_OP_SPLIT) {
|
||||
struct ceph_mds_snap_realm *ri;
|
||||
|
||||
/*
|
||||
* A "split" breaks part of an existing realm off into
|
||||
* a new realm. The MDS provides a list of inodes
|
||||
* (with caps) and child realms that belong to the new
|
||||
* child.
|
||||
*/
|
||||
split_inos = p;
|
||||
p += sizeof(u64) * num_split_inos;
|
||||
split_realms = p;
|
||||
p += sizeof(u64) * num_split_realms;
|
||||
ceph_decode_need(&p, e, sizeof(*ri), bad);
|
||||
/* we will peek at realm info here, but will _not_
|
||||
* advance p, as the realm update will occur below in
|
||||
* ceph_update_snap_trace. */
|
||||
ri = p;
|
||||
|
||||
realm = ceph_lookup_snap_realm(mdsc, split);
|
||||
if (!realm) {
|
||||
realm = ceph_create_snap_realm(mdsc, split);
|
||||
if (IS_ERR(realm))
|
||||
goto out;
|
||||
}
|
||||
ceph_get_snap_realm(mdsc, realm);
|
||||
|
||||
dout("splitting snap_realm %llx %p\n", realm->ino, realm);
|
||||
for (i = 0; i < num_split_inos; i++) {
|
||||
struct ceph_vino vino = {
|
||||
.ino = le64_to_cpu(split_inos[i]),
|
||||
.snap = CEPH_NOSNAP,
|
||||
};
|
||||
struct inode *inode = ceph_find_inode(sb, vino);
|
||||
struct ceph_inode_info *ci;
|
||||
|
||||
if (!inode)
|
||||
continue;
|
||||
ci = ceph_inode(inode);
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
if (!ci->i_snap_realm)
|
||||
goto skip_inode;
|
||||
/*
|
||||
* If this inode belongs to a realm that was
|
||||
* created after our new realm, we experienced
|
||||
* a race (due to another split notifications
|
||||
* arriving from a different MDS). So skip
|
||||
* this inode.
|
||||
*/
|
||||
if (ci->i_snap_realm->created >
|
||||
le64_to_cpu(ri->created)) {
|
||||
dout(" leaving %p in newer realm %llx %p\n",
|
||||
inode, ci->i_snap_realm->ino,
|
||||
ci->i_snap_realm);
|
||||
goto skip_inode;
|
||||
}
|
||||
dout(" will move %p to split realm %llx %p\n",
|
||||
inode, realm->ino, realm);
|
||||
/*
|
||||
* Remove the inode from the realm's inode
|
||||
* list, but don't add it to the new realm
|
||||
* yet. We don't want the cap_snap to be
|
||||
* queued (again) by ceph_update_snap_trace()
|
||||
* below. Queue it _now_, under the old context.
|
||||
*/
|
||||
list_del_init(&ci->i_snap_realm_item);
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
ceph_queue_cap_snap(ci,
|
||||
ci->i_snap_realm->cached_context);
|
||||
|
||||
iput(inode);
|
||||
continue;
|
||||
|
||||
skip_inode:
|
||||
spin_unlock(&inode->i_lock);
|
||||
iput(inode);
|
||||
}
|
||||
|
||||
/* we may have taken some of the old realm's children. */
|
||||
for (i = 0; i < num_split_realms; i++) {
|
||||
struct ceph_snap_realm *child =
|
||||
ceph_lookup_snap_realm(mdsc,
|
||||
le64_to_cpu(split_realms[i]));
|
||||
if (!child)
|
||||
continue;
|
||||
adjust_snap_realm_parent(mdsc, child, realm->ino);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* update using the provided snap trace. if we are deleting a
|
||||
* snap, we can avoid queueing cap_snaps.
|
||||
*/
|
||||
ceph_update_snap_trace(mdsc, p, e,
|
||||
op == CEPH_SNAP_OP_DESTROY);
|
||||
|
||||
if (op == CEPH_SNAP_OP_SPLIT) {
|
||||
/*
|
||||
* ok, _now_ add the inodes into the new realm.
|
||||
*/
|
||||
for (i = 0; i < num_split_inos; i++) {
|
||||
struct ceph_vino vino = {
|
||||
.ino = le64_to_cpu(split_inos[i]),
|
||||
.snap = CEPH_NOSNAP,
|
||||
};
|
||||
struct inode *inode = ceph_find_inode(sb, vino);
|
||||
struct ceph_inode_info *ci;
|
||||
|
||||
if (!inode)
|
||||
continue;
|
||||
ci = ceph_inode(inode);
|
||||
spin_lock(&inode->i_lock);
|
||||
if (!ci->i_snap_realm)
|
||||
goto split_skip_inode;
|
||||
ceph_put_snap_realm(mdsc, ci->i_snap_realm);
|
||||
spin_lock(&realm->inodes_with_caps_lock);
|
||||
list_add(&ci->i_snap_realm_item,
|
||||
&realm->inodes_with_caps);
|
||||
ci->i_snap_realm = realm;
|
||||
spin_unlock(&realm->inodes_with_caps_lock);
|
||||
ceph_get_snap_realm(mdsc, realm);
|
||||
split_skip_inode:
|
||||
spin_unlock(&inode->i_lock);
|
||||
iput(inode);
|
||||
}
|
||||
|
||||
/* we took a reference when we created the realm, above */
|
||||
ceph_put_snap_realm(mdsc, realm);
|
||||
}
|
||||
|
||||
__cleanup_empty_realms(mdsc);
|
||||
|
||||
up_write(&mdsc->snap_rwsem);
|
||||
|
||||
flush_snaps(mdsc);
|
||||
return;
|
||||
|
||||
bad:
|
||||
pr_err("corrupt snap message from mds%d\n", mds);
|
||||
ceph_msg_dump(msg);
|
||||
out:
|
||||
if (locked_rwsem)
|
||||
up_write(&mdsc->snap_rwsem);
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
|
1030
fs/ceph/super.c
Normal file
1030
fs/ceph/super.c
Normal file
File diff suppressed because it is too large
Load Diff
901
fs/ceph/super.h
Normal file
901
fs/ceph/super.h
Normal file
@ -0,0 +1,901 @@
|
||||
#ifndef _FS_CEPH_SUPER_H
|
||||
#define _FS_CEPH_SUPER_H
|
||||
|
||||
#include "ceph_debug.h"
|
||||
|
||||
#include <asm/unaligned.h>
|
||||
#include <linux/backing-dev.h>
|
||||
#include <linux/completion.h>
|
||||
#include <linux/exportfs.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/mempool.h>
|
||||
#include <linux/pagemap.h>
|
||||
#include <linux/wait.h>
|
||||
#include <linux/writeback.h>
|
||||
|
||||
#include "types.h"
|
||||
#include "messenger.h"
|
||||
#include "msgpool.h"
|
||||
#include "mon_client.h"
|
||||
#include "mds_client.h"
|
||||
#include "osd_client.h"
|
||||
#include "ceph_fs.h"
|
||||
|
||||
/* f_type in struct statfs */
|
||||
#define CEPH_SUPER_MAGIC 0x00c36400
|
||||
|
||||
/* large granularity for statfs utilization stats to facilitate
|
||||
* large volume sizes on 32-bit machines. */
|
||||
#define CEPH_BLOCK_SHIFT 20 /* 1 MB */
|
||||
#define CEPH_BLOCK (1 << CEPH_BLOCK_SHIFT)
|
||||
|
||||
/*
|
||||
* mount options
|
||||
*/
|
||||
#define CEPH_OPT_FSID (1<<0)
|
||||
#define CEPH_OPT_NOSHARE (1<<1) /* don't share client with other sbs */
|
||||
#define CEPH_OPT_MYIP (1<<2) /* specified my ip */
|
||||
#define CEPH_OPT_DIRSTAT (1<<4) /* funky `cat dirname` for stats */
|
||||
#define CEPH_OPT_RBYTES (1<<5) /* dir st_bytes = rbytes */
|
||||
#define CEPH_OPT_NOCRC (1<<6) /* no data crc on writes */
|
||||
#define CEPH_OPT_NOASYNCREADDIR (1<<7) /* no dcache readdir */
|
||||
|
||||
#define CEPH_OPT_DEFAULT (CEPH_OPT_RBYTES)
|
||||
|
||||
#define ceph_set_opt(client, opt) \
|
||||
(client)->mount_args->flags |= CEPH_OPT_##opt;
|
||||
#define ceph_test_opt(client, opt) \
|
||||
(!!((client)->mount_args->flags & CEPH_OPT_##opt))
|
||||
|
||||
|
||||
struct ceph_mount_args {
|
||||
int sb_flags;
|
||||
int num_mon;
|
||||
struct ceph_entity_addr *mon_addr;
|
||||
int flags;
|
||||
int mount_timeout;
|
||||
int osd_idle_ttl;
|
||||
int caps_wanted_delay_min, caps_wanted_delay_max;
|
||||
struct ceph_fsid fsid;
|
||||
struct ceph_entity_addr my_addr;
|
||||
int wsize;
|
||||
int rsize; /* max readahead */
|
||||
int max_readdir; /* max readdir size */
|
||||
int congestion_kb; /* max readdir size */
|
||||
int osd_timeout;
|
||||
int osd_keepalive_timeout;
|
||||
char *snapdir_name; /* default ".snap" */
|
||||
char *name;
|
||||
char *secret;
|
||||
int cap_release_safety;
|
||||
};
|
||||
|
||||
/*
|
||||
* defaults
|
||||
*/
|
||||
#define CEPH_MOUNT_TIMEOUT_DEFAULT 60
|
||||
#define CEPH_OSD_TIMEOUT_DEFAULT 60 /* seconds */
|
||||
#define CEPH_OSD_KEEPALIVE_DEFAULT 5
|
||||
#define CEPH_OSD_IDLE_TTL_DEFAULT 60
|
||||
#define CEPH_MOUNT_RSIZE_DEFAULT (512*1024) /* readahead */
|
||||
|
||||
#define CEPH_MSG_MAX_FRONT_LEN (16*1024*1024)
|
||||
#define CEPH_MSG_MAX_DATA_LEN (16*1024*1024)
|
||||
|
||||
#define CEPH_SNAPDIRNAME_DEFAULT ".snap"
|
||||
#define CEPH_AUTH_NAME_DEFAULT "guest"
|
||||
|
||||
/*
|
||||
* Delay telling the MDS we no longer want caps, in case we reopen
|
||||
* the file. Delay a minimum amount of time, even if we send a cap
|
||||
* message for some other reason. Otherwise, take the oppotunity to
|
||||
* update the mds to avoid sending another message later.
|
||||
*/
|
||||
#define CEPH_CAPS_WANTED_DELAY_MIN_DEFAULT 5 /* cap release delay */
|
||||
#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT 60 /* cap release delay */
|
||||
|
||||
|
||||
/* mount state */
|
||||
enum {
|
||||
CEPH_MOUNT_MOUNTING,
|
||||
CEPH_MOUNT_MOUNTED,
|
||||
CEPH_MOUNT_UNMOUNTING,
|
||||
CEPH_MOUNT_UNMOUNTED,
|
||||
CEPH_MOUNT_SHUTDOWN,
|
||||
};
|
||||
|
||||
/*
|
||||
* subtract jiffies
|
||||
*/
|
||||
static inline unsigned long time_sub(unsigned long a, unsigned long b)
|
||||
{
|
||||
BUG_ON(time_after(b, a));
|
||||
return (long)a - (long)b;
|
||||
}
|
||||
|
||||
/*
|
||||
* per-filesystem client state
|
||||
*
|
||||
* possibly shared by multiple mount points, if they are
|
||||
* mounting the same ceph filesystem/cluster.
|
||||
*/
|
||||
struct ceph_client {
|
||||
struct ceph_fsid fsid;
|
||||
bool have_fsid;
|
||||
|
||||
struct mutex mount_mutex; /* serialize mount attempts */
|
||||
struct ceph_mount_args *mount_args;
|
||||
|
||||
struct super_block *sb;
|
||||
|
||||
unsigned long mount_state;
|
||||
wait_queue_head_t auth_wq;
|
||||
|
||||
int auth_err;
|
||||
|
||||
int min_caps; /* min caps i added */
|
||||
|
||||
struct ceph_messenger *msgr; /* messenger instance */
|
||||
struct ceph_mon_client monc;
|
||||
struct ceph_mds_client mdsc;
|
||||
struct ceph_osd_client osdc;
|
||||
|
||||
/* writeback */
|
||||
mempool_t *wb_pagevec_pool;
|
||||
struct workqueue_struct *wb_wq;
|
||||
struct workqueue_struct *pg_inv_wq;
|
||||
struct workqueue_struct *trunc_wq;
|
||||
atomic_long_t writeback_count;
|
||||
|
||||
struct backing_dev_info backing_dev_info;
|
||||
|
||||
#ifdef CONFIG_DEBUG_FS
|
||||
struct dentry *debugfs_monmap;
|
||||
struct dentry *debugfs_mdsmap, *debugfs_osdmap;
|
||||
struct dentry *debugfs_dir, *debugfs_dentry_lru, *debugfs_caps;
|
||||
struct dentry *debugfs_congestion_kb;
|
||||
struct dentry *debugfs_bdi;
|
||||
#endif
|
||||
};
|
||||
|
||||
static inline struct ceph_client *ceph_client(struct super_block *sb)
|
||||
{
|
||||
return sb->s_fs_info;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* File i/o capability. This tracks shared state with the metadata
|
||||
* server that allows us to cache or writeback attributes or to read
|
||||
* and write data. For any given inode, we should have one or more
|
||||
* capabilities, one issued by each metadata server, and our
|
||||
* cumulative access is the OR of all issued capabilities.
|
||||
*
|
||||
* Each cap is referenced by the inode's i_caps rbtree and by per-mds
|
||||
* session capability lists.
|
||||
*/
|
||||
struct ceph_cap {
|
||||
struct ceph_inode_info *ci;
|
||||
struct rb_node ci_node; /* per-ci cap tree */
|
||||
struct ceph_mds_session *session;
|
||||
struct list_head session_caps; /* per-session caplist */
|
||||
int mds;
|
||||
u64 cap_id; /* unique cap id (mds provided) */
|
||||
int issued; /* latest, from the mds */
|
||||
int implemented; /* implemented superset of issued (for revocation) */
|
||||
int mds_wanted;
|
||||
u32 seq, issue_seq, mseq;
|
||||
u32 cap_gen; /* active/stale cycle */
|
||||
unsigned long last_used;
|
||||
struct list_head caps_item;
|
||||
};
|
||||
|
||||
#define CHECK_CAPS_NODELAY 1 /* do not delay any further */
|
||||
#define CHECK_CAPS_AUTHONLY 2 /* only check auth cap */
|
||||
#define CHECK_CAPS_FLUSH 4 /* flush any dirty caps */
|
||||
|
||||
/*
|
||||
* Snapped cap state that is pending flush to mds. When a snapshot occurs,
|
||||
* we first complete any in-process sync writes and writeback any dirty
|
||||
* data before flushing the snapped state (tracked here) back to the MDS.
|
||||
*/
|
||||
struct ceph_cap_snap {
|
||||
atomic_t nref;
|
||||
struct ceph_inode_info *ci;
|
||||
struct list_head ci_item, flushing_item;
|
||||
|
||||
u64 follows, flush_tid;
|
||||
int issued, dirty;
|
||||
struct ceph_snap_context *context;
|
||||
|
||||
mode_t mode;
|
||||
uid_t uid;
|
||||
gid_t gid;
|
||||
|
||||
void *xattr_blob;
|
||||
int xattr_len;
|
||||
u64 xattr_version;
|
||||
|
||||
u64 size;
|
||||
struct timespec mtime, atime, ctime;
|
||||
u64 time_warp_seq;
|
||||
int writing; /* a sync write is still in progress */
|
||||
int dirty_pages; /* dirty pages awaiting writeback */
|
||||
};
|
||||
|
||||
static inline void ceph_put_cap_snap(struct ceph_cap_snap *capsnap)
|
||||
{
|
||||
if (atomic_dec_and_test(&capsnap->nref))
|
||||
kfree(capsnap);
|
||||
}
|
||||
|
||||
/*
|
||||
* The frag tree describes how a directory is fragmented, potentially across
|
||||
* multiple metadata servers. It is also used to indicate points where
|
||||
* metadata authority is delegated, and whether/where metadata is replicated.
|
||||
*
|
||||
* A _leaf_ frag will be present in the i_fragtree IFF there is
|
||||
* delegation info. That is, if mds >= 0 || ndist > 0.
|
||||
*/
|
||||
#define CEPH_MAX_DIRFRAG_REP 4
|
||||
|
||||
struct ceph_inode_frag {
|
||||
struct rb_node node;
|
||||
|
||||
/* fragtree state */
|
||||
u32 frag;
|
||||
int split_by; /* i.e. 2^(split_by) children */
|
||||
|
||||
/* delegation and replication info */
|
||||
int mds; /* -1 if same authority as parent */
|
||||
int ndist; /* >0 if replicated */
|
||||
int dist[CEPH_MAX_DIRFRAG_REP];
|
||||
};
|
||||
|
||||
/*
|
||||
* We cache inode xattrs as an encoded blob until they are first used,
|
||||
* at which point we parse them into an rbtree.
|
||||
*/
|
||||
struct ceph_inode_xattr {
|
||||
struct rb_node node;
|
||||
|
||||
const char *name;
|
||||
int name_len;
|
||||
const char *val;
|
||||
int val_len;
|
||||
int dirty;
|
||||
|
||||
int should_free_name;
|
||||
int should_free_val;
|
||||
};
|
||||
|
||||
struct ceph_inode_xattrs_info {
|
||||
/*
|
||||
* (still encoded) xattr blob. we avoid the overhead of parsing
|
||||
* this until someone actually calls getxattr, etc.
|
||||
*
|
||||
* blob->vec.iov_len == 4 implies there are no xattrs; blob ==
|
||||
* NULL means we don't know.
|
||||
*/
|
||||
struct ceph_buffer *blob, *prealloc_blob;
|
||||
|
||||
struct rb_root index;
|
||||
bool dirty;
|
||||
int count;
|
||||
int names_size;
|
||||
int vals_size;
|
||||
u64 version, index_version;
|
||||
};
|
||||
|
||||
/*
|
||||
* Ceph inode.
|
||||
*/
|
||||
#define CEPH_I_COMPLETE 1 /* we have complete directory cached */
|
||||
#define CEPH_I_NODELAY 4 /* do not delay cap release */
|
||||
#define CEPH_I_FLUSH 8 /* do not delay flush of dirty metadata */
|
||||
#define CEPH_I_NOFLUSH 16 /* do not flush dirty caps */
|
||||
|
||||
struct ceph_inode_info {
|
||||
struct ceph_vino i_vino; /* ceph ino + snap */
|
||||
|
||||
u64 i_version;
|
||||
u32 i_time_warp_seq;
|
||||
|
||||
unsigned i_ceph_flags;
|
||||
unsigned long i_release_count;
|
||||
|
||||
struct ceph_file_layout i_layout;
|
||||
char *i_symlink;
|
||||
|
||||
/* for dirs */
|
||||
struct timespec i_rctime;
|
||||
u64 i_rbytes, i_rfiles, i_rsubdirs;
|
||||
u64 i_files, i_subdirs;
|
||||
u64 i_max_offset; /* largest readdir offset, set with I_COMPLETE */
|
||||
|
||||
struct rb_root i_fragtree;
|
||||
struct mutex i_fragtree_mutex;
|
||||
|
||||
struct ceph_inode_xattrs_info i_xattrs;
|
||||
|
||||
/* capabilities. protected _both_ by i_lock and cap->session's
|
||||
* s_mutex. */
|
||||
struct rb_root i_caps; /* cap list */
|
||||
struct ceph_cap *i_auth_cap; /* authoritative cap, if any */
|
||||
unsigned i_dirty_caps, i_flushing_caps; /* mask of dirtied fields */
|
||||
struct list_head i_dirty_item, i_flushing_item;
|
||||
u64 i_cap_flush_seq;
|
||||
/* we need to track cap writeback on a per-cap-bit basis, to allow
|
||||
* overlapping, pipelined cap flushes to the mds. we can probably
|
||||
* reduce the tid to 8 bits if we're concerned about inode size. */
|
||||
u16 i_cap_flush_last_tid, i_cap_flush_tid[CEPH_CAP_BITS];
|
||||
wait_queue_head_t i_cap_wq; /* threads waiting on a capability */
|
||||
unsigned long i_hold_caps_min; /* jiffies */
|
||||
unsigned long i_hold_caps_max; /* jiffies */
|
||||
struct list_head i_cap_delay_list; /* for delayed cap release to mds */
|
||||
int i_cap_exporting_mds; /* to handle cap migration between */
|
||||
unsigned i_cap_exporting_mseq; /* mds's. */
|
||||
unsigned i_cap_exporting_issued;
|
||||
struct ceph_cap_reservation i_cap_migration_resv;
|
||||
struct list_head i_cap_snaps; /* snapped state pending flush to mds */
|
||||
struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 */
|
||||
unsigned i_snap_caps; /* cap bits for snapped files */
|
||||
|
||||
int i_nr_by_mode[CEPH_FILE_MODE_NUM]; /* open file counts */
|
||||
|
||||
u32 i_truncate_seq; /* last truncate to smaller size */
|
||||
u64 i_truncate_size; /* and the size we last truncated down to */
|
||||
int i_truncate_pending; /* still need to call vmtruncate */
|
||||
|
||||
u64 i_max_size; /* max file size authorized by mds */
|
||||
u64 i_reported_size; /* (max_)size reported to or requested of mds */
|
||||
u64 i_wanted_max_size; /* offset we'd like to write too */
|
||||
u64 i_requested_max_size; /* max_size we've requested */
|
||||
|
||||
/* held references to caps */
|
||||
int i_pin_ref;
|
||||
int i_rd_ref, i_rdcache_ref, i_wr_ref;
|
||||
int i_wrbuffer_ref, i_wrbuffer_ref_head;
|
||||
u32 i_shared_gen; /* increment each time we get FILE_SHARED */
|
||||
u32 i_rdcache_gen; /* we increment this each time we get
|
||||
FILE_CACHE. If it's non-zero, we
|
||||
_may_ have cached pages. */
|
||||
u32 i_rdcache_revoking; /* RDCACHE gen to async invalidate, if any */
|
||||
|
||||
struct list_head i_unsafe_writes; /* uncommitted sync writes */
|
||||
struct list_head i_unsafe_dirops; /* uncommitted mds dir ops */
|
||||
spinlock_t i_unsafe_lock;
|
||||
|
||||
struct ceph_snap_realm *i_snap_realm; /* snap realm (if caps) */
|
||||
int i_snap_realm_counter; /* snap realm (if caps) */
|
||||
struct list_head i_snap_realm_item;
|
||||
struct list_head i_snap_flush_item;
|
||||
|
||||
struct work_struct i_wb_work; /* writeback work */
|
||||
struct work_struct i_pg_inv_work; /* page invalidation work */
|
||||
|
||||
struct work_struct i_vmtruncate_work;
|
||||
|
||||
struct inode vfs_inode; /* at end */
|
||||
};
|
||||
|
||||
static inline struct ceph_inode_info *ceph_inode(struct inode *inode)
|
||||
{
|
||||
return container_of(inode, struct ceph_inode_info, vfs_inode);
|
||||
}
|
||||
|
||||
static inline void ceph_i_clear(struct inode *inode, unsigned mask)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
ci->i_ceph_flags &= ~mask;
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
|
||||
static inline void ceph_i_set(struct inode *inode, unsigned mask)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
ci->i_ceph_flags |= mask;
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
|
||||
static inline bool ceph_i_test(struct inode *inode, unsigned mask)
|
||||
{
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
bool r;
|
||||
|
||||
smp_mb();
|
||||
r = (ci->i_ceph_flags & mask) == mask;
|
||||
return r;
|
||||
}
|
||||
|
||||
|
||||
/* find a specific frag @f */
|
||||
extern struct ceph_inode_frag *__ceph_find_frag(struct ceph_inode_info *ci,
|
||||
u32 f);
|
||||
|
||||
/*
|
||||
* choose fragment for value @v. copy frag content to pfrag, if leaf
|
||||
* exists
|
||||
*/
|
||||
extern u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
|
||||
struct ceph_inode_frag *pfrag,
|
||||
int *found);
|
||||
|
||||
/*
|
||||
* Ceph dentry state
|
||||
*/
|
||||
struct ceph_dentry_info {
|
||||
struct ceph_mds_session *lease_session;
|
||||
u32 lease_gen, lease_shared_gen;
|
||||
u32 lease_seq;
|
||||
unsigned long lease_renew_after, lease_renew_from;
|
||||
struct list_head lru;
|
||||
struct dentry *dentry;
|
||||
u64 time;
|
||||
u64 offset;
|
||||
};
|
||||
|
||||
static inline struct ceph_dentry_info *ceph_dentry(struct dentry *dentry)
|
||||
{
|
||||
return (struct ceph_dentry_info *)dentry->d_fsdata;
|
||||
}
|
||||
|
||||
static inline loff_t ceph_make_fpos(unsigned frag, unsigned off)
|
||||
{
|
||||
return ((loff_t)frag << 32) | (loff_t)off;
|
||||
}
|
||||
|
||||
/*
|
||||
* ino_t is <64 bits on many architectures, blech.
|
||||
*
|
||||
* don't include snap in ino hash, at least for now.
|
||||
*/
|
||||
static inline ino_t ceph_vino_to_ino(struct ceph_vino vino)
|
||||
{
|
||||
ino_t ino = (ino_t)vino.ino; /* ^ (vino.snap << 20); */
|
||||
#if BITS_PER_LONG == 32
|
||||
ino ^= vino.ino >> (sizeof(u64)-sizeof(ino_t)) * 8;
|
||||
if (!ino)
|
||||
ino = 1;
|
||||
#endif
|
||||
return ino;
|
||||
}
|
||||
|
||||
static inline int ceph_set_ino_cb(struct inode *inode, void *data)
|
||||
{
|
||||
ceph_inode(inode)->i_vino = *(struct ceph_vino *)data;
|
||||
inode->i_ino = ceph_vino_to_ino(*(struct ceph_vino *)data);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline struct ceph_vino ceph_vino(struct inode *inode)
|
||||
{
|
||||
return ceph_inode(inode)->i_vino;
|
||||
}
|
||||
|
||||
/* for printf-style formatting */
|
||||
#define ceph_vinop(i) ceph_inode(i)->i_vino.ino, ceph_inode(i)->i_vino.snap
|
||||
|
||||
static inline u64 ceph_ino(struct inode *inode)
|
||||
{
|
||||
return ceph_inode(inode)->i_vino.ino;
|
||||
}
|
||||
static inline u64 ceph_snap(struct inode *inode)
|
||||
{
|
||||
return ceph_inode(inode)->i_vino.snap;
|
||||
}
|
||||
|
||||
static inline int ceph_ino_compare(struct inode *inode, void *data)
|
||||
{
|
||||
struct ceph_vino *pvino = (struct ceph_vino *)data;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
return ci->i_vino.ino == pvino->ino &&
|
||||
ci->i_vino.snap == pvino->snap;
|
||||
}
|
||||
|
||||
static inline struct inode *ceph_find_inode(struct super_block *sb,
|
||||
struct ceph_vino vino)
|
||||
{
|
||||
ino_t t = ceph_vino_to_ino(vino);
|
||||
return ilookup5(sb, t, ceph_ino_compare, &vino);
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* caps helpers
|
||||
*/
|
||||
static inline bool __ceph_is_any_real_caps(struct ceph_inode_info *ci)
|
||||
{
|
||||
return !RB_EMPTY_ROOT(&ci->i_caps);
|
||||
}
|
||||
|
||||
extern int __ceph_caps_issued(struct ceph_inode_info *ci, int *implemented);
|
||||
extern int __ceph_caps_issued_mask(struct ceph_inode_info *ci, int mask, int t);
|
||||
extern int __ceph_caps_issued_other(struct ceph_inode_info *ci,
|
||||
struct ceph_cap *cap);
|
||||
|
||||
static inline int ceph_caps_issued(struct ceph_inode_info *ci)
|
||||
{
|
||||
int issued;
|
||||
spin_lock(&ci->vfs_inode.i_lock);
|
||||
issued = __ceph_caps_issued(ci, NULL);
|
||||
spin_unlock(&ci->vfs_inode.i_lock);
|
||||
return issued;
|
||||
}
|
||||
|
||||
static inline int ceph_caps_issued_mask(struct ceph_inode_info *ci, int mask,
|
||||
int touch)
|
||||
{
|
||||
int r;
|
||||
spin_lock(&ci->vfs_inode.i_lock);
|
||||
r = __ceph_caps_issued_mask(ci, mask, touch);
|
||||
spin_unlock(&ci->vfs_inode.i_lock);
|
||||
return r;
|
||||
}
|
||||
|
||||
static inline int __ceph_caps_dirty(struct ceph_inode_info *ci)
|
||||
{
|
||||
return ci->i_dirty_caps | ci->i_flushing_caps;
|
||||
}
|
||||
extern void __ceph_mark_dirty_caps(struct ceph_inode_info *ci, int mask);
|
||||
|
||||
extern int ceph_caps_revoking(struct ceph_inode_info *ci, int mask);
|
||||
extern int __ceph_caps_used(struct ceph_inode_info *ci);
|
||||
|
||||
extern int __ceph_caps_file_wanted(struct ceph_inode_info *ci);
|
||||
|
||||
/*
|
||||
* wanted, by virtue of open file modes AND cap refs (buffered/cached data)
|
||||
*/
|
||||
static inline int __ceph_caps_wanted(struct ceph_inode_info *ci)
|
||||
{
|
||||
int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
|
||||
if (w & CEPH_CAP_FILE_BUFFER)
|
||||
w |= CEPH_CAP_FILE_EXCL; /* we want EXCL if dirty data */
|
||||
return w;
|
||||
}
|
||||
|
||||
/* what the mds thinks we want */
|
||||
extern int __ceph_caps_mds_wanted(struct ceph_inode_info *ci);
|
||||
|
||||
extern void ceph_caps_init(void);
|
||||
extern void ceph_caps_finalize(void);
|
||||
extern void ceph_adjust_min_caps(int delta);
|
||||
extern int ceph_reserve_caps(struct ceph_cap_reservation *ctx, int need);
|
||||
extern int ceph_unreserve_caps(struct ceph_cap_reservation *ctx);
|
||||
extern void ceph_reservation_status(struct ceph_client *client,
|
||||
int *total, int *avail, int *used,
|
||||
int *reserved, int *min);
|
||||
|
||||
static inline struct ceph_client *ceph_inode_to_client(struct inode *inode)
|
||||
{
|
||||
return (struct ceph_client *)inode->i_sb->s_fs_info;
|
||||
}
|
||||
|
||||
static inline struct ceph_client *ceph_sb_to_client(struct super_block *sb)
|
||||
{
|
||||
return (struct ceph_client *)sb->s_fs_info;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* we keep buffered readdir results attached to file->private_data
|
||||
*/
|
||||
struct ceph_file_info {
|
||||
int fmode; /* initialized on open */
|
||||
|
||||
/* readdir: position within the dir */
|
||||
u32 frag;
|
||||
struct ceph_mds_request *last_readdir;
|
||||
int at_end;
|
||||
|
||||
/* readdir: position within a frag */
|
||||
unsigned offset; /* offset of last chunk, adjusted for . and .. */
|
||||
u64 next_offset; /* offset of next chunk (last_name's + 1) */
|
||||
char *last_name; /* last entry in previous chunk */
|
||||
struct dentry *dentry; /* next dentry (for dcache readdir) */
|
||||
unsigned long dir_release_count;
|
||||
|
||||
/* used for -o dirstat read() on directory thing */
|
||||
char *dir_info;
|
||||
int dir_info_len;
|
||||
};
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* snapshots
|
||||
*/
|
||||
|
||||
/*
|
||||
* A "snap context" is the set of existing snapshots when we
|
||||
* write data. It is used by the OSD to guide its COW behavior.
|
||||
*
|
||||
* The ceph_snap_context is refcounted, and attached to each dirty
|
||||
* page, indicating which context the dirty data belonged when it was
|
||||
* dirtied.
|
||||
*/
|
||||
struct ceph_snap_context {
|
||||
atomic_t nref;
|
||||
u64 seq;
|
||||
int num_snaps;
|
||||
u64 snaps[];
|
||||
};
|
||||
|
||||
static inline struct ceph_snap_context *
|
||||
ceph_get_snap_context(struct ceph_snap_context *sc)
|
||||
{
|
||||
/*
|
||||
printk("get_snap_context %p %d -> %d\n", sc, atomic_read(&sc->nref),
|
||||
atomic_read(&sc->nref)+1);
|
||||
*/
|
||||
if (sc)
|
||||
atomic_inc(&sc->nref);
|
||||
return sc;
|
||||
}
|
||||
|
||||
static inline void ceph_put_snap_context(struct ceph_snap_context *sc)
|
||||
{
|
||||
if (!sc)
|
||||
return;
|
||||
/*
|
||||
printk("put_snap_context %p %d -> %d\n", sc, atomic_read(&sc->nref),
|
||||
atomic_read(&sc->nref)-1);
|
||||
*/
|
||||
if (atomic_dec_and_test(&sc->nref)) {
|
||||
/*printk(" deleting snap_context %p\n", sc);*/
|
||||
kfree(sc);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* A "snap realm" describes a subset of the file hierarchy sharing
|
||||
* the same set of snapshots that apply to it. The realms themselves
|
||||
* are organized into a hierarchy, such that children inherit (some of)
|
||||
* the snapshots of their parents.
|
||||
*
|
||||
* All inodes within the realm that have capabilities are linked into a
|
||||
* per-realm list.
|
||||
*/
|
||||
struct ceph_snap_realm {
|
||||
u64 ino;
|
||||
atomic_t nref;
|
||||
struct rb_node node;
|
||||
|
||||
u64 created, seq;
|
||||
u64 parent_ino;
|
||||
u64 parent_since; /* snapid when our current parent became so */
|
||||
|
||||
u64 *prior_parent_snaps; /* snaps inherited from any parents we */
|
||||
int num_prior_parent_snaps; /* had prior to parent_since */
|
||||
u64 *snaps; /* snaps specific to this realm */
|
||||
int num_snaps;
|
||||
|
||||
struct ceph_snap_realm *parent;
|
||||
struct list_head children; /* list of child realms */
|
||||
struct list_head child_item;
|
||||
|
||||
struct list_head empty_item; /* if i have ref==0 */
|
||||
|
||||
/* the current set of snaps for this realm */
|
||||
struct ceph_snap_context *cached_context;
|
||||
|
||||
struct list_head inodes_with_caps;
|
||||
spinlock_t inodes_with_caps_lock;
|
||||
};
|
||||
|
||||
|
||||
|
||||
/*
|
||||
* calculate the number of pages a given length and offset map onto,
|
||||
* if we align the data.
|
||||
*/
|
||||
static inline int calc_pages_for(u64 off, u64 len)
|
||||
{
|
||||
return ((off+len+PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT) -
|
||||
(off >> PAGE_CACHE_SHIFT);
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* snap.c */
|
||||
struct ceph_snap_realm *ceph_lookup_snap_realm(struct ceph_mds_client *mdsc,
|
||||
u64 ino);
|
||||
extern void ceph_get_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm);
|
||||
extern void ceph_put_snap_realm(struct ceph_mds_client *mdsc,
|
||||
struct ceph_snap_realm *realm);
|
||||
extern int ceph_update_snap_trace(struct ceph_mds_client *m,
|
||||
void *p, void *e, bool deletion);
|
||||
extern void ceph_handle_snap(struct ceph_mds_client *mdsc,
|
||||
struct ceph_mds_session *session,
|
||||
struct ceph_msg *msg);
|
||||
extern void ceph_queue_cap_snap(struct ceph_inode_info *ci,
|
||||
struct ceph_snap_context *snapc);
|
||||
extern int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
|
||||
struct ceph_cap_snap *capsnap);
|
||||
extern void ceph_cleanup_empty_realms(struct ceph_mds_client *mdsc);
|
||||
|
||||
/*
|
||||
* a cap_snap is "pending" if it is still awaiting an in-progress
|
||||
* sync write (that may/may not still update size, mtime, etc.).
|
||||
*/
|
||||
static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
|
||||
{
|
||||
return !list_empty(&ci->i_cap_snaps) &&
|
||||
list_entry(ci->i_cap_snaps.prev, struct ceph_cap_snap,
|
||||
ci_item)->writing;
|
||||
}
|
||||
|
||||
|
||||
/* super.c */
|
||||
extern struct kmem_cache *ceph_inode_cachep;
|
||||
extern struct kmem_cache *ceph_cap_cachep;
|
||||
extern struct kmem_cache *ceph_dentry_cachep;
|
||||
extern struct kmem_cache *ceph_file_cachep;
|
||||
|
||||
extern const char *ceph_msg_type_name(int type);
|
||||
extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
|
||||
|
||||
#define FSID_FORMAT "%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-" \
|
||||
"%02x%02x%02x%02x%02x%02x"
|
||||
#define PR_FSID(f) (f)->fsid[0], (f)->fsid[1], (f)->fsid[2], (f)->fsid[3], \
|
||||
(f)->fsid[4], (f)->fsid[5], (f)->fsid[6], (f)->fsid[7], \
|
||||
(f)->fsid[8], (f)->fsid[9], (f)->fsid[10], (f)->fsid[11], \
|
||||
(f)->fsid[12], (f)->fsid[13], (f)->fsid[14], (f)->fsid[15]
|
||||
|
||||
/* inode.c */
|
||||
extern const struct inode_operations ceph_file_iops;
|
||||
|
||||
extern struct inode *ceph_alloc_inode(struct super_block *sb);
|
||||
extern void ceph_destroy_inode(struct inode *inode);
|
||||
|
||||
extern struct inode *ceph_get_inode(struct super_block *sb,
|
||||
struct ceph_vino vino);
|
||||
extern struct inode *ceph_get_snapdir(struct inode *parent);
|
||||
extern int ceph_fill_file_size(struct inode *inode, int issued,
|
||||
u32 truncate_seq, u64 truncate_size, u64 size);
|
||||
extern void ceph_fill_file_time(struct inode *inode, int issued,
|
||||
u64 time_warp_seq, struct timespec *ctime,
|
||||
struct timespec *mtime, struct timespec *atime);
|
||||
extern int ceph_fill_trace(struct super_block *sb,
|
||||
struct ceph_mds_request *req,
|
||||
struct ceph_mds_session *session);
|
||||
extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
|
||||
struct ceph_mds_session *session);
|
||||
|
||||
extern int ceph_inode_holds_cap(struct inode *inode, int mask);
|
||||
|
||||
extern int ceph_inode_set_size(struct inode *inode, loff_t size);
|
||||
extern void __ceph_do_pending_vmtruncate(struct inode *inode);
|
||||
extern void ceph_queue_vmtruncate(struct inode *inode);
|
||||
|
||||
extern void ceph_queue_invalidate(struct inode *inode);
|
||||
extern void ceph_queue_writeback(struct inode *inode);
|
||||
|
||||
extern int ceph_do_getattr(struct inode *inode, int mask);
|
||||
extern int ceph_permission(struct inode *inode, int mask);
|
||||
extern int ceph_setattr(struct dentry *dentry, struct iattr *attr);
|
||||
extern int ceph_getattr(struct vfsmount *mnt, struct dentry *dentry,
|
||||
struct kstat *stat);
|
||||
|
||||
/* xattr.c */
|
||||
extern int ceph_setxattr(struct dentry *, const char *, const void *,
|
||||
size_t, int);
|
||||
extern ssize_t ceph_getxattr(struct dentry *, const char *, void *, size_t);
|
||||
extern ssize_t ceph_listxattr(struct dentry *, char *, size_t);
|
||||
extern int ceph_removexattr(struct dentry *, const char *);
|
||||
extern void __ceph_build_xattrs_blob(struct ceph_inode_info *ci);
|
||||
extern void __ceph_destroy_xattrs(struct ceph_inode_info *ci);
|
||||
|
||||
/* caps.c */
|
||||
extern const char *ceph_cap_string(int c);
|
||||
extern void ceph_handle_caps(struct ceph_mds_session *session,
|
||||
struct ceph_msg *msg);
|
||||
extern int ceph_add_cap(struct inode *inode,
|
||||
struct ceph_mds_session *session, u64 cap_id,
|
||||
int fmode, unsigned issued, unsigned wanted,
|
||||
unsigned cap, unsigned seq, u64 realmino, int flags,
|
||||
struct ceph_cap_reservation *caps_reservation);
|
||||
extern void __ceph_remove_cap(struct ceph_cap *cap);
|
||||
static inline void ceph_remove_cap(struct ceph_cap *cap)
|
||||
{
|
||||
struct inode *inode = &cap->ci->vfs_inode;
|
||||
spin_lock(&inode->i_lock);
|
||||
__ceph_remove_cap(cap);
|
||||
spin_unlock(&inode->i_lock);
|
||||
}
|
||||
extern void ceph_put_cap(struct ceph_cap *cap);
|
||||
|
||||
extern void ceph_queue_caps_release(struct inode *inode);
|
||||
extern int ceph_write_inode(struct inode *inode, struct writeback_control *wbc);
|
||||
extern int ceph_fsync(struct file *file, struct dentry *dentry, int datasync);
|
||||
extern void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
|
||||
struct ceph_mds_session *session);
|
||||
extern int ceph_get_cap_mds(struct inode *inode);
|
||||
extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
|
||||
extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
|
||||
extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
|
||||
struct ceph_snap_context *snapc);
|
||||
extern void __ceph_flush_snaps(struct ceph_inode_info *ci,
|
||||
struct ceph_mds_session **psession);
|
||||
extern void ceph_check_caps(struct ceph_inode_info *ci, int flags,
|
||||
struct ceph_mds_session *session);
|
||||
extern void ceph_check_delayed_caps(struct ceph_mds_client *mdsc);
|
||||
extern void ceph_flush_dirty_caps(struct ceph_mds_client *mdsc);
|
||||
|
||||
extern int ceph_encode_inode_release(void **p, struct inode *inode,
|
||||
int mds, int drop, int unless, int force);
|
||||
extern int ceph_encode_dentry_release(void **p, struct dentry *dn,
|
||||
int mds, int drop, int unless);
|
||||
|
||||
extern int ceph_get_caps(struct ceph_inode_info *ci, int need, int want,
|
||||
int *got, loff_t endoff);
|
||||
|
||||
/* for counting open files by mode */
|
||||
static inline void __ceph_get_fmode(struct ceph_inode_info *ci, int mode)
|
||||
{
|
||||
ci->i_nr_by_mode[mode]++;
|
||||
}
|
||||
extern void ceph_put_fmode(struct ceph_inode_info *ci, int mode);
|
||||
|
||||
/* addr.c */
|
||||
extern const struct address_space_operations ceph_aops;
|
||||
extern int ceph_mmap(struct file *file, struct vm_area_struct *vma);
|
||||
|
||||
/* file.c */
|
||||
extern const struct file_operations ceph_file_fops;
|
||||
extern const struct address_space_operations ceph_aops;
|
||||
extern int ceph_open(struct inode *inode, struct file *file);
|
||||
extern struct dentry *ceph_lookup_open(struct inode *dir, struct dentry *dentry,
|
||||
struct nameidata *nd, int mode,
|
||||
int locked_dir);
|
||||
extern int ceph_release(struct inode *inode, struct file *filp);
|
||||
extern void ceph_release_page_vector(struct page **pages, int num_pages);
|
||||
|
||||
/* dir.c */
|
||||
extern const struct file_operations ceph_dir_fops;
|
||||
extern const struct inode_operations ceph_dir_iops;
|
||||
extern struct dentry_operations ceph_dentry_ops, ceph_snap_dentry_ops,
|
||||
ceph_snapdir_dentry_ops;
|
||||
|
||||
extern int ceph_handle_notrace_create(struct inode *dir, struct dentry *dentry);
|
||||
extern struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
|
||||
struct dentry *dentry, int err);
|
||||
|
||||
extern void ceph_dentry_lru_add(struct dentry *dn);
|
||||
extern void ceph_dentry_lru_touch(struct dentry *dn);
|
||||
extern void ceph_dentry_lru_del(struct dentry *dn);
|
||||
|
||||
/*
|
||||
* our d_ops vary depending on whether the inode is live,
|
||||
* snapshotted (read-only), or a virtual ".snap" directory.
|
||||
*/
|
||||
int ceph_init_dentry(struct dentry *dentry);
|
||||
|
||||
|
||||
/* ioctl.c */
|
||||
extern long ceph_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
||||
|
||||
/* export.c */
|
||||
extern const struct export_operations ceph_export_ops;
|
||||
|
||||
/* debugfs.c */
|
||||
extern int ceph_debugfs_init(void);
|
||||
extern void ceph_debugfs_cleanup(void);
|
||||
extern int ceph_debugfs_client_init(struct ceph_client *client);
|
||||
extern void ceph_debugfs_client_cleanup(struct ceph_client *client);
|
||||
|
||||
static inline struct inode *get_dentry_parent_inode(struct dentry *dentry)
|
||||
{
|
||||
if (dentry && dentry->d_parent)
|
||||
return dentry->d_parent->d_inode;
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
#endif /* _FS_CEPH_SUPER_H */
|
29
fs/ceph/types.h
Normal file
29
fs/ceph/types.h
Normal file
@ -0,0 +1,29 @@
|
||||
#ifndef _FS_CEPH_TYPES_H
|
||||
#define _FS_CEPH_TYPES_H
|
||||
|
||||
/* needed before including ceph_fs.h */
|
||||
#include <linux/in.h>
|
||||
#include <linux/types.h>
|
||||
#include <linux/fcntl.h>
|
||||
#include <linux/string.h>
|
||||
|
||||
#include "ceph_fs.h"
|
||||
#include "ceph_frag.h"
|
||||
#include "ceph_hash.h"
|
||||
|
||||
/*
|
||||
* Identify inodes by both their ino AND snapshot id (a u64).
|
||||
*/
|
||||
struct ceph_vino {
|
||||
u64 ino;
|
||||
u64 snap;
|
||||
};
|
||||
|
||||
|
||||
/* context for the caps reservation mechanism */
|
||||
struct ceph_cap_reservation {
|
||||
int count;
|
||||
};
|
||||
|
||||
|
||||
#endif
|
844
fs/ceph/xattr.c
Normal file
844
fs/ceph/xattr.c
Normal file
@ -0,0 +1,844 @@
|
||||
#include "ceph_debug.h"
|
||||
#include "super.h"
|
||||
#include "decode.h"
|
||||
|
||||
#include <linux/xattr.h>
|
||||
|
||||
static bool ceph_is_valid_xattr(const char *name)
|
||||
{
|
||||
return !strncmp(name, XATTR_SECURITY_PREFIX,
|
||||
XATTR_SECURITY_PREFIX_LEN) ||
|
||||
!strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN) ||
|
||||
!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN);
|
||||
}
|
||||
|
||||
/*
|
||||
* These define virtual xattrs exposing the recursive directory
|
||||
* statistics and layout metadata.
|
||||
*/
|
||||
struct ceph_vxattr_cb {
|
||||
bool readonly;
|
||||
char *name;
|
||||
size_t (*getxattr_cb)(struct ceph_inode_info *ci, char *val,
|
||||
size_t size);
|
||||
};
|
||||
|
||||
/* directories */
|
||||
|
||||
static size_t ceph_vxattrcb_entries(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_files + ci->i_subdirs);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_files(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_files);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_subdirs(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_subdirs);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_rentries(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_rfiles + ci->i_rsubdirs);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_rfiles(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_rfiles);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_rsubdirs(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_rsubdirs);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_rbytes(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%lld", ci->i_rbytes);
|
||||
}
|
||||
|
||||
static size_t ceph_vxattrcb_rctime(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
return snprintf(val, size, "%ld.%ld", (long)ci->i_rctime.tv_sec,
|
||||
(long)ci->i_rctime.tv_nsec);
|
||||
}
|
||||
|
||||
static struct ceph_vxattr_cb ceph_dir_vxattrs[] = {
|
||||
{ true, "user.ceph.dir.entries", ceph_vxattrcb_entries},
|
||||
{ true, "user.ceph.dir.files", ceph_vxattrcb_files},
|
||||
{ true, "user.ceph.dir.subdirs", ceph_vxattrcb_subdirs},
|
||||
{ true, "user.ceph.dir.rentries", ceph_vxattrcb_rentries},
|
||||
{ true, "user.ceph.dir.rfiles", ceph_vxattrcb_rfiles},
|
||||
{ true, "user.ceph.dir.rsubdirs", ceph_vxattrcb_rsubdirs},
|
||||
{ true, "user.ceph.dir.rbytes", ceph_vxattrcb_rbytes},
|
||||
{ true, "user.ceph.dir.rctime", ceph_vxattrcb_rctime},
|
||||
{ true, NULL, NULL }
|
||||
};
|
||||
|
||||
/* files */
|
||||
|
||||
static size_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val,
|
||||
size_t size)
|
||||
{
|
||||
int ret;
|
||||
|
||||
ret = snprintf(val, size,
|
||||
"chunk_bytes=%lld\nstripe_count=%lld\nobject_size=%lld\n",
|
||||
(unsigned long long)ceph_file_layout_su(ci->i_layout),
|
||||
(unsigned long long)ceph_file_layout_stripe_count(ci->i_layout),
|
||||
(unsigned long long)ceph_file_layout_object_size(ci->i_layout));
|
||||
if (ceph_file_layout_pg_preferred(ci->i_layout))
|
||||
ret += snprintf(val + ret, size, "preferred_osd=%lld\n",
|
||||
(unsigned long long)ceph_file_layout_pg_preferred(
|
||||
ci->i_layout));
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct ceph_vxattr_cb ceph_file_vxattrs[] = {
|
||||
{ true, "user.ceph.layout", ceph_vxattrcb_layout},
|
||||
{ NULL, NULL }
|
||||
};
|
||||
|
||||
static struct ceph_vxattr_cb *ceph_inode_vxattrs(struct inode *inode)
|
||||
{
|
||||
if (S_ISDIR(inode->i_mode))
|
||||
return ceph_dir_vxattrs;
|
||||
else if (S_ISREG(inode->i_mode))
|
||||
return ceph_file_vxattrs;
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static struct ceph_vxattr_cb *ceph_match_vxattr(struct ceph_vxattr_cb *vxattr,
|
||||
const char *name)
|
||||
{
|
||||
do {
|
||||
if (strcmp(vxattr->name, name) == 0)
|
||||
return vxattr;
|
||||
vxattr++;
|
||||
} while (vxattr->name);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static int __set_xattr(struct ceph_inode_info *ci,
|
||||
const char *name, int name_len,
|
||||
const char *val, int val_len,
|
||||
int dirty,
|
||||
int should_free_name, int should_free_val,
|
||||
struct ceph_inode_xattr **newxattr)
|
||||
{
|
||||
struct rb_node **p;
|
||||
struct rb_node *parent = NULL;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
int c;
|
||||
int new = 0;
|
||||
|
||||
p = &ci->i_xattrs.index.rb_node;
|
||||
while (*p) {
|
||||
parent = *p;
|
||||
xattr = rb_entry(parent, struct ceph_inode_xattr, node);
|
||||
c = strncmp(name, xattr->name, min(name_len, xattr->name_len));
|
||||
if (c < 0)
|
||||
p = &(*p)->rb_left;
|
||||
else if (c > 0)
|
||||
p = &(*p)->rb_right;
|
||||
else {
|
||||
if (name_len == xattr->name_len)
|
||||
break;
|
||||
else if (name_len < xattr->name_len)
|
||||
p = &(*p)->rb_left;
|
||||
else
|
||||
p = &(*p)->rb_right;
|
||||
}
|
||||
xattr = NULL;
|
||||
}
|
||||
|
||||
if (!xattr) {
|
||||
new = 1;
|
||||
xattr = *newxattr;
|
||||
xattr->name = name;
|
||||
xattr->name_len = name_len;
|
||||
xattr->should_free_name = should_free_name;
|
||||
|
||||
ci->i_xattrs.count++;
|
||||
dout("__set_xattr count=%d\n", ci->i_xattrs.count);
|
||||
} else {
|
||||
kfree(*newxattr);
|
||||
*newxattr = NULL;
|
||||
if (xattr->should_free_val)
|
||||
kfree((void *)xattr->val);
|
||||
|
||||
if (should_free_name) {
|
||||
kfree((void *)name);
|
||||
name = xattr->name;
|
||||
}
|
||||
ci->i_xattrs.names_size -= xattr->name_len;
|
||||
ci->i_xattrs.vals_size -= xattr->val_len;
|
||||
}
|
||||
if (!xattr) {
|
||||
pr_err("__set_xattr ENOMEM on %p %llx.%llx xattr %s=%s\n",
|
||||
&ci->vfs_inode, ceph_vinop(&ci->vfs_inode), name,
|
||||
xattr->val);
|
||||
return -ENOMEM;
|
||||
}
|
||||
ci->i_xattrs.names_size += name_len;
|
||||
ci->i_xattrs.vals_size += val_len;
|
||||
if (val)
|
||||
xattr->val = val;
|
||||
else
|
||||
xattr->val = "";
|
||||
|
||||
xattr->val_len = val_len;
|
||||
xattr->dirty = dirty;
|
||||
xattr->should_free_val = (val && should_free_val);
|
||||
|
||||
if (new) {
|
||||
rb_link_node(&xattr->node, parent, p);
|
||||
rb_insert_color(&xattr->node, &ci->i_xattrs.index);
|
||||
dout("__set_xattr_val p=%p\n", p);
|
||||
}
|
||||
|
||||
dout("__set_xattr_val added %llx.%llx xattr %p %s=%.*s\n",
|
||||
ceph_vinop(&ci->vfs_inode), xattr, name, val_len, val);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct ceph_inode_xattr *__get_xattr(struct ceph_inode_info *ci,
|
||||
const char *name)
|
||||
{
|
||||
struct rb_node **p;
|
||||
struct rb_node *parent = NULL;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
int c;
|
||||
|
||||
p = &ci->i_xattrs.index.rb_node;
|
||||
while (*p) {
|
||||
parent = *p;
|
||||
xattr = rb_entry(parent, struct ceph_inode_xattr, node);
|
||||
c = strncmp(name, xattr->name, xattr->name_len);
|
||||
if (c < 0)
|
||||
p = &(*p)->rb_left;
|
||||
else if (c > 0)
|
||||
p = &(*p)->rb_right;
|
||||
else {
|
||||
dout("__get_xattr %s: found %.*s\n", name,
|
||||
xattr->val_len, xattr->val);
|
||||
return xattr;
|
||||
}
|
||||
}
|
||||
|
||||
dout("__get_xattr %s: not found\n", name);
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static void __free_xattr(struct ceph_inode_xattr *xattr)
|
||||
{
|
||||
BUG_ON(!xattr);
|
||||
|
||||
if (xattr->should_free_name)
|
||||
kfree((void *)xattr->name);
|
||||
if (xattr->should_free_val)
|
||||
kfree((void *)xattr->val);
|
||||
|
||||
kfree(xattr);
|
||||
}
|
||||
|
||||
static int __remove_xattr(struct ceph_inode_info *ci,
|
||||
struct ceph_inode_xattr *xattr)
|
||||
{
|
||||
if (!xattr)
|
||||
return -EOPNOTSUPP;
|
||||
|
||||
rb_erase(&xattr->node, &ci->i_xattrs.index);
|
||||
|
||||
if (xattr->should_free_name)
|
||||
kfree((void *)xattr->name);
|
||||
if (xattr->should_free_val)
|
||||
kfree((void *)xattr->val);
|
||||
|
||||
ci->i_xattrs.names_size -= xattr->name_len;
|
||||
ci->i_xattrs.vals_size -= xattr->val_len;
|
||||
ci->i_xattrs.count--;
|
||||
kfree(xattr);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int __remove_xattr_by_name(struct ceph_inode_info *ci,
|
||||
const char *name)
|
||||
{
|
||||
struct rb_node **p;
|
||||
struct ceph_inode_xattr *xattr;
|
||||
int err;
|
||||
|
||||
p = &ci->i_xattrs.index.rb_node;
|
||||
xattr = __get_xattr(ci, name);
|
||||
err = __remove_xattr(ci, xattr);
|
||||
return err;
|
||||
}
|
||||
|
||||
static char *__copy_xattr_names(struct ceph_inode_info *ci,
|
||||
char *dest)
|
||||
{
|
||||
struct rb_node *p;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
|
||||
p = rb_first(&ci->i_xattrs.index);
|
||||
dout("__copy_xattr_names count=%d\n", ci->i_xattrs.count);
|
||||
|
||||
while (p) {
|
||||
xattr = rb_entry(p, struct ceph_inode_xattr, node);
|
||||
memcpy(dest, xattr->name, xattr->name_len);
|
||||
dest[xattr->name_len] = '\0';
|
||||
|
||||
dout("dest=%s %p (%s) (%d/%d)\n", dest, xattr, xattr->name,
|
||||
xattr->name_len, ci->i_xattrs.names_size);
|
||||
|
||||
dest += xattr->name_len + 1;
|
||||
p = rb_next(p);
|
||||
}
|
||||
|
||||
return dest;
|
||||
}
|
||||
|
||||
void __ceph_destroy_xattrs(struct ceph_inode_info *ci)
|
||||
{
|
||||
struct rb_node *p, *tmp;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
|
||||
p = rb_first(&ci->i_xattrs.index);
|
||||
|
||||
dout("__ceph_destroy_xattrs p=%p\n", p);
|
||||
|
||||
while (p) {
|
||||
xattr = rb_entry(p, struct ceph_inode_xattr, node);
|
||||
tmp = p;
|
||||
p = rb_next(tmp);
|
||||
dout("__ceph_destroy_xattrs next p=%p (%.*s)\n", p,
|
||||
xattr->name_len, xattr->name);
|
||||
rb_erase(tmp, &ci->i_xattrs.index);
|
||||
|
||||
__free_xattr(xattr);
|
||||
}
|
||||
|
||||
ci->i_xattrs.names_size = 0;
|
||||
ci->i_xattrs.vals_size = 0;
|
||||
ci->i_xattrs.index_version = 0;
|
||||
ci->i_xattrs.count = 0;
|
||||
ci->i_xattrs.index = RB_ROOT;
|
||||
}
|
||||
|
||||
static int __build_xattrs(struct inode *inode)
|
||||
{
|
||||
u32 namelen;
|
||||
u32 numattr = 0;
|
||||
void *p, *end;
|
||||
u32 len;
|
||||
const char *name, *val;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
int xattr_version;
|
||||
struct ceph_inode_xattr **xattrs = NULL;
|
||||
int err = 0;
|
||||
int i;
|
||||
|
||||
dout("__build_xattrs() len=%d\n",
|
||||
ci->i_xattrs.blob ? (int)ci->i_xattrs.blob->vec.iov_len : 0);
|
||||
|
||||
if (ci->i_xattrs.index_version >= ci->i_xattrs.version)
|
||||
return 0; /* already built */
|
||||
|
||||
__ceph_destroy_xattrs(ci);
|
||||
|
||||
start:
|
||||
/* updated internal xattr rb tree */
|
||||
if (ci->i_xattrs.blob && ci->i_xattrs.blob->vec.iov_len > 4) {
|
||||
p = ci->i_xattrs.blob->vec.iov_base;
|
||||
end = p + ci->i_xattrs.blob->vec.iov_len;
|
||||
ceph_decode_32_safe(&p, end, numattr, bad);
|
||||
xattr_version = ci->i_xattrs.version;
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *),
|
||||
GFP_NOFS);
|
||||
err = -ENOMEM;
|
||||
if (!xattrs)
|
||||
goto bad_lock;
|
||||
memset(xattrs, 0, numattr*sizeof(struct ceph_xattr *));
|
||||
for (i = 0; i < numattr; i++) {
|
||||
xattrs[i] = kmalloc(sizeof(struct ceph_inode_xattr),
|
||||
GFP_NOFS);
|
||||
if (!xattrs[i])
|
||||
goto bad_lock;
|
||||
}
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
if (ci->i_xattrs.version != xattr_version) {
|
||||
/* lost a race, retry */
|
||||
for (i = 0; i < numattr; i++)
|
||||
kfree(xattrs[i]);
|
||||
kfree(xattrs);
|
||||
goto start;
|
||||
}
|
||||
err = -EIO;
|
||||
while (numattr--) {
|
||||
ceph_decode_32_safe(&p, end, len, bad);
|
||||
namelen = len;
|
||||
name = p;
|
||||
p += len;
|
||||
ceph_decode_32_safe(&p, end, len, bad);
|
||||
val = p;
|
||||
p += len;
|
||||
|
||||
err = __set_xattr(ci, name, namelen, val, len,
|
||||
0, 0, 0, &xattrs[numattr]);
|
||||
|
||||
if (err < 0)
|
||||
goto bad;
|
||||
}
|
||||
kfree(xattrs);
|
||||
}
|
||||
ci->i_xattrs.index_version = ci->i_xattrs.version;
|
||||
ci->i_xattrs.dirty = false;
|
||||
|
||||
return err;
|
||||
bad_lock:
|
||||
spin_lock(&inode->i_lock);
|
||||
bad:
|
||||
if (xattrs) {
|
||||
for (i = 0; i < numattr; i++)
|
||||
kfree(xattrs[i]);
|
||||
kfree(xattrs);
|
||||
}
|
||||
ci->i_xattrs.names_size = 0;
|
||||
return err;
|
||||
}
|
||||
|
||||
static int __get_required_blob_size(struct ceph_inode_info *ci, int name_size,
|
||||
int val_size)
|
||||
{
|
||||
/*
|
||||
* 4 bytes for the length, and additional 4 bytes per each xattr name,
|
||||
* 4 bytes per each value
|
||||
*/
|
||||
int size = 4 + ci->i_xattrs.count*(4 + 4) +
|
||||
ci->i_xattrs.names_size +
|
||||
ci->i_xattrs.vals_size;
|
||||
dout("__get_required_blob_size c=%d names.size=%d vals.size=%d\n",
|
||||
ci->i_xattrs.count, ci->i_xattrs.names_size,
|
||||
ci->i_xattrs.vals_size);
|
||||
|
||||
if (name_size)
|
||||
size += 4 + 4 + name_size + val_size;
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
/*
|
||||
* If there are dirty xattrs, reencode xattrs into the prealloc_blob
|
||||
* and swap into place.
|
||||
*/
|
||||
void __ceph_build_xattrs_blob(struct ceph_inode_info *ci)
|
||||
{
|
||||
struct rb_node *p;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
void *dest;
|
||||
|
||||
dout("__build_xattrs_blob %p\n", &ci->vfs_inode);
|
||||
if (ci->i_xattrs.dirty) {
|
||||
int need = __get_required_blob_size(ci, 0, 0);
|
||||
|
||||
BUG_ON(need > ci->i_xattrs.prealloc_blob->alloc_len);
|
||||
|
||||
p = rb_first(&ci->i_xattrs.index);
|
||||
dest = ci->i_xattrs.prealloc_blob->vec.iov_base;
|
||||
|
||||
ceph_encode_32(&dest, ci->i_xattrs.count);
|
||||
while (p) {
|
||||
xattr = rb_entry(p, struct ceph_inode_xattr, node);
|
||||
|
||||
ceph_encode_32(&dest, xattr->name_len);
|
||||
memcpy(dest, xattr->name, xattr->name_len);
|
||||
dest += xattr->name_len;
|
||||
ceph_encode_32(&dest, xattr->val_len);
|
||||
memcpy(dest, xattr->val, xattr->val_len);
|
||||
dest += xattr->val_len;
|
||||
|
||||
p = rb_next(p);
|
||||
}
|
||||
|
||||
/* adjust buffer len; it may be larger than we need */
|
||||
ci->i_xattrs.prealloc_blob->vec.iov_len =
|
||||
dest - ci->i_xattrs.prealloc_blob->vec.iov_base;
|
||||
|
||||
if (ci->i_xattrs.blob)
|
||||
ceph_buffer_put(ci->i_xattrs.blob);
|
||||
ci->i_xattrs.blob = ci->i_xattrs.prealloc_blob;
|
||||
ci->i_xattrs.prealloc_blob = NULL;
|
||||
ci->i_xattrs.dirty = false;
|
||||
}
|
||||
}
|
||||
|
||||
ssize_t ceph_getxattr(struct dentry *dentry, const char *name, void *value,
|
||||
size_t size)
|
||||
{
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_vxattr_cb *vxattrs = ceph_inode_vxattrs(inode);
|
||||
int err;
|
||||
struct ceph_inode_xattr *xattr;
|
||||
struct ceph_vxattr_cb *vxattr = NULL;
|
||||
|
||||
if (!ceph_is_valid_xattr(name))
|
||||
return -ENODATA;
|
||||
|
||||
/* let's see if a virtual xattr was requested */
|
||||
if (vxattrs)
|
||||
vxattr = ceph_match_vxattr(vxattrs, name);
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
dout("getxattr %p ver=%lld index_ver=%lld\n", inode,
|
||||
ci->i_xattrs.version, ci->i_xattrs.index_version);
|
||||
|
||||
if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 1) &&
|
||||
(ci->i_xattrs.index_version >= ci->i_xattrs.version)) {
|
||||
goto get_xattr;
|
||||
} else {
|
||||
spin_unlock(&inode->i_lock);
|
||||
/* get xattrs from mds (if we don't already have them) */
|
||||
err = ceph_do_getattr(inode, CEPH_STAT_CAP_XATTR);
|
||||
if (err)
|
||||
return err;
|
||||
}
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
|
||||
if (vxattr && vxattr->readonly) {
|
||||
err = vxattr->getxattr_cb(ci, value, size);
|
||||
goto out;
|
||||
}
|
||||
|
||||
err = __build_xattrs(inode);
|
||||
if (err < 0)
|
||||
goto out;
|
||||
|
||||
get_xattr:
|
||||
err = -ENODATA; /* == ENOATTR */
|
||||
xattr = __get_xattr(ci, name);
|
||||
if (!xattr) {
|
||||
if (vxattr)
|
||||
err = vxattr->getxattr_cb(ci, value, size);
|
||||
goto out;
|
||||
}
|
||||
|
||||
err = -ERANGE;
|
||||
if (size && size < xattr->val_len)
|
||||
goto out;
|
||||
|
||||
err = xattr->val_len;
|
||||
if (size == 0)
|
||||
goto out;
|
||||
|
||||
memcpy(value, xattr->val, xattr->val_len);
|
||||
|
||||
out:
|
||||
spin_unlock(&inode->i_lock);
|
||||
return err;
|
||||
}
|
||||
|
||||
ssize_t ceph_listxattr(struct dentry *dentry, char *names, size_t size)
|
||||
{
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_vxattr_cb *vxattrs = ceph_inode_vxattrs(inode);
|
||||
u32 vir_namelen = 0;
|
||||
u32 namelen;
|
||||
int err;
|
||||
u32 len;
|
||||
int i;
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
dout("listxattr %p ver=%lld index_ver=%lld\n", inode,
|
||||
ci->i_xattrs.version, ci->i_xattrs.index_version);
|
||||
|
||||
if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 1) &&
|
||||
(ci->i_xattrs.index_version > ci->i_xattrs.version)) {
|
||||
goto list_xattr;
|
||||
} else {
|
||||
spin_unlock(&inode->i_lock);
|
||||
err = ceph_do_getattr(inode, CEPH_STAT_CAP_XATTR);
|
||||
if (err)
|
||||
return err;
|
||||
}
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
|
||||
err = __build_xattrs(inode);
|
||||
if (err < 0)
|
||||
goto out;
|
||||
|
||||
list_xattr:
|
||||
vir_namelen = 0;
|
||||
/* include virtual dir xattrs */
|
||||
if (vxattrs)
|
||||
for (i = 0; vxattrs[i].name; i++)
|
||||
vir_namelen += strlen(vxattrs[i].name) + 1;
|
||||
/* adding 1 byte per each variable due to the null termination */
|
||||
namelen = vir_namelen + ci->i_xattrs.names_size + ci->i_xattrs.count;
|
||||
err = -ERANGE;
|
||||
if (size && namelen > size)
|
||||
goto out;
|
||||
|
||||
err = namelen;
|
||||
if (size == 0)
|
||||
goto out;
|
||||
|
||||
names = __copy_xattr_names(ci, names);
|
||||
|
||||
/* virtual xattr names, too */
|
||||
if (vxattrs)
|
||||
for (i = 0; vxattrs[i].name; i++) {
|
||||
len = sprintf(names, "%s", vxattrs[i].name);
|
||||
names += len + 1;
|
||||
}
|
||||
|
||||
out:
|
||||
spin_unlock(&inode->i_lock);
|
||||
return err;
|
||||
}
|
||||
|
||||
static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
|
||||
const char *value, size_t size, int flags)
|
||||
{
|
||||
struct ceph_client *client = ceph_client(dentry->d_sb);
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct inode *parent_inode = dentry->d_parent->d_inode;
|
||||
struct ceph_mds_request *req;
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
int err;
|
||||
int i, nr_pages;
|
||||
struct page **pages = NULL;
|
||||
void *kaddr;
|
||||
|
||||
/* copy value into some pages */
|
||||
nr_pages = calc_pages_for(0, size);
|
||||
if (nr_pages) {
|
||||
pages = kmalloc(sizeof(pages[0])*nr_pages, GFP_NOFS);
|
||||
if (!pages)
|
||||
return -ENOMEM;
|
||||
err = -ENOMEM;
|
||||
for (i = 0; i < nr_pages; i++) {
|
||||
pages[i] = alloc_page(GFP_NOFS);
|
||||
if (!pages[i]) {
|
||||
nr_pages = i;
|
||||
goto out;
|
||||
}
|
||||
kaddr = kmap(pages[i]);
|
||||
memcpy(kaddr, value + i*PAGE_CACHE_SIZE,
|
||||
min(PAGE_CACHE_SIZE, size-i*PAGE_CACHE_SIZE));
|
||||
}
|
||||
}
|
||||
|
||||
dout("setxattr value=%.*s\n", (int)size, value);
|
||||
|
||||
/* do request */
|
||||
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_SETXATTR,
|
||||
USE_AUTH_MDS);
|
||||
if (IS_ERR(req)) {
|
||||
err = PTR_ERR(req);
|
||||
goto out;
|
||||
}
|
||||
req->r_inode = igrab(inode);
|
||||
req->r_inode_drop = CEPH_CAP_XATTR_SHARED;
|
||||
req->r_num_caps = 1;
|
||||
req->r_args.setxattr.flags = cpu_to_le32(flags);
|
||||
req->r_path2 = kstrdup(name, GFP_NOFS);
|
||||
|
||||
req->r_pages = pages;
|
||||
req->r_num_pages = nr_pages;
|
||||
req->r_data_len = size;
|
||||
|
||||
dout("xattr.ver (before): %lld\n", ci->i_xattrs.version);
|
||||
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
|
||||
ceph_mdsc_put_request(req);
|
||||
dout("xattr.ver (after): %lld\n", ci->i_xattrs.version);
|
||||
|
||||
out:
|
||||
if (pages) {
|
||||
for (i = 0; i < nr_pages; i++)
|
||||
__free_page(pages[i]);
|
||||
kfree(pages);
|
||||
}
|
||||
return err;
|
||||
}
|
||||
|
||||
int ceph_setxattr(struct dentry *dentry, const char *name,
|
||||
const void *value, size_t size, int flags)
|
||||
{
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_vxattr_cb *vxattrs = ceph_inode_vxattrs(inode);
|
||||
int err;
|
||||
int name_len = strlen(name);
|
||||
int val_len = size;
|
||||
char *newname = NULL;
|
||||
char *newval = NULL;
|
||||
struct ceph_inode_xattr *xattr = NULL;
|
||||
int issued;
|
||||
int required_blob_size;
|
||||
|
||||
if (ceph_snap(inode) != CEPH_NOSNAP)
|
||||
return -EROFS;
|
||||
|
||||
if (!ceph_is_valid_xattr(name))
|
||||
return -EOPNOTSUPP;
|
||||
|
||||
if (vxattrs) {
|
||||
struct ceph_vxattr_cb *vxattr =
|
||||
ceph_match_vxattr(vxattrs, name);
|
||||
if (vxattr && vxattr->readonly)
|
||||
return -EOPNOTSUPP;
|
||||
}
|
||||
|
||||
/* preallocate memory for xattr name, value, index node */
|
||||
err = -ENOMEM;
|
||||
newname = kmalloc(name_len + 1, GFP_NOFS);
|
||||
if (!newname)
|
||||
goto out;
|
||||
memcpy(newname, name, name_len + 1);
|
||||
|
||||
if (val_len) {
|
||||
newval = kmalloc(val_len + 1, GFP_NOFS);
|
||||
if (!newval)
|
||||
goto out;
|
||||
memcpy(newval, value, val_len);
|
||||
newval[val_len] = '\0';
|
||||
}
|
||||
|
||||
xattr = kmalloc(sizeof(struct ceph_inode_xattr), GFP_NOFS);
|
||||
if (!xattr)
|
||||
goto out;
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
retry:
|
||||
issued = __ceph_caps_issued(ci, NULL);
|
||||
if (!(issued & CEPH_CAP_XATTR_EXCL))
|
||||
goto do_sync;
|
||||
__build_xattrs(inode);
|
||||
|
||||
required_blob_size = __get_required_blob_size(ci, name_len, val_len);
|
||||
|
||||
if (!ci->i_xattrs.prealloc_blob ||
|
||||
required_blob_size > ci->i_xattrs.prealloc_blob->alloc_len) {
|
||||
struct ceph_buffer *blob = NULL;
|
||||
|
||||
spin_unlock(&inode->i_lock);
|
||||
dout(" preaallocating new blob size=%d\n", required_blob_size);
|
||||
blob = ceph_buffer_new(required_blob_size, GFP_NOFS);
|
||||
if (!blob)
|
||||
goto out;
|
||||
spin_lock(&inode->i_lock);
|
||||
if (ci->i_xattrs.prealloc_blob)
|
||||
ceph_buffer_put(ci->i_xattrs.prealloc_blob);
|
||||
ci->i_xattrs.prealloc_blob = blob;
|
||||
goto retry;
|
||||
}
|
||||
|
||||
dout("setxattr %p issued %s\n", inode, ceph_cap_string(issued));
|
||||
err = __set_xattr(ci, newname, name_len, newval,
|
||||
val_len, 1, 1, 1, &xattr);
|
||||
__ceph_mark_dirty_caps(ci, CEPH_CAP_XATTR_EXCL);
|
||||
ci->i_xattrs.dirty = true;
|
||||
inode->i_ctime = CURRENT_TIME;
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
return err;
|
||||
|
||||
do_sync:
|
||||
spin_unlock(&inode->i_lock);
|
||||
err = ceph_sync_setxattr(dentry, name, value, size, flags);
|
||||
out:
|
||||
kfree(newname);
|
||||
kfree(newval);
|
||||
kfree(xattr);
|
||||
return err;
|
||||
}
|
||||
|
||||
static int ceph_send_removexattr(struct dentry *dentry, const char *name)
|
||||
{
|
||||
struct ceph_client *client = ceph_client(dentry->d_sb);
|
||||
struct ceph_mds_client *mdsc = &client->mdsc;
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct inode *parent_inode = dentry->d_parent->d_inode;
|
||||
struct ceph_mds_request *req;
|
||||
int err;
|
||||
|
||||
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_RMXATTR,
|
||||
USE_AUTH_MDS);
|
||||
if (IS_ERR(req))
|
||||
return PTR_ERR(req);
|
||||
req->r_inode = igrab(inode);
|
||||
req->r_inode_drop = CEPH_CAP_XATTR_SHARED;
|
||||
req->r_num_caps = 1;
|
||||
req->r_path2 = kstrdup(name, GFP_NOFS);
|
||||
|
||||
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
|
||||
ceph_mdsc_put_request(req);
|
||||
return err;
|
||||
}
|
||||
|
||||
int ceph_removexattr(struct dentry *dentry, const char *name)
|
||||
{
|
||||
struct inode *inode = dentry->d_inode;
|
||||
struct ceph_inode_info *ci = ceph_inode(inode);
|
||||
struct ceph_vxattr_cb *vxattrs = ceph_inode_vxattrs(inode);
|
||||
int issued;
|
||||
int err;
|
||||
|
||||
if (ceph_snap(inode) != CEPH_NOSNAP)
|
||||
return -EROFS;
|
||||
|
||||
if (!ceph_is_valid_xattr(name))
|
||||
return -EOPNOTSUPP;
|
||||
|
||||
if (vxattrs) {
|
||||
struct ceph_vxattr_cb *vxattr =
|
||||
ceph_match_vxattr(vxattrs, name);
|
||||
if (vxattr && vxattr->readonly)
|
||||
return -EOPNOTSUPP;
|
||||
}
|
||||
|
||||
spin_lock(&inode->i_lock);
|
||||
__build_xattrs(inode);
|
||||
issued = __ceph_caps_issued(ci, NULL);
|
||||
dout("removexattr %p issued %s\n", inode, ceph_cap_string(issued));
|
||||
|
||||
if (!(issued & CEPH_CAP_XATTR_EXCL))
|
||||
goto do_sync;
|
||||
|
||||
err = __remove_xattr_by_name(ceph_inode(inode), name);
|
||||
__ceph_mark_dirty_caps(ci, CEPH_CAP_XATTR_EXCL);
|
||||
ci->i_xattrs.dirty = true;
|
||||
inode->i_ctime = CURRENT_TIME;
|
||||
|
||||
spin_unlock(&inode->i_lock);
|
||||
|
||||
return err;
|
||||
do_sync:
|
||||
spin_unlock(&inode->i_lock);
|
||||
err = ceph_send_removexattr(dentry, name);
|
||||
return err;
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user