Commit Graph

2157 Commits

Author SHA1 Message Date
Jeff Layton
d3c51ae1b8 ceph: don't clobber i_snap_caps on non-I_NEW inode
We want the snapdir to mirror the non-snapped directory's attributes for
most things, but i_snap_caps represents the caps granted on the snapshot
directory by the MDS itself. A misbehaving MDS could issue different
caps for the snapdir and we lose them here.

Only reset i_snap_caps when the inode is I_NEW. Also, move the setting
of i_op and i_fop inside the if block since they should never change
anyway.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Gustavo A. R. Silva
fcaddb1d85 ceph: fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding a break and a goto statements instead
of just letting the code fall through to the next case.

URL: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
4987005600 ceph: convert ceph_readpages to ceph_readahead
Convert ceph_readpages to ceph_readahead and make it use
netfs_readahead. With this we can rip out a lot of the old
readpage/readpages infrastructure.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
d801327d95 ceph: convert ceph_write_begin to netfs_write_begin
Convert ceph_write_begin to use the netfs_write_begin helper. Most of
the ops we need for it are already in place from the readpage conversion
but we do add a new check_write_begin op since ceph needs to be able to
vet whether there is an incompatible writeback already in flight before
reading in the page.

With this, we can also remove the old ceph_do_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
f0702876e1 ceph: convert ceph_readpage to netfs_readpage
Have the ceph KConfig select NETFS_SUPPORT. Add a new netfs ops
structure and the operations for it. Convert ceph_readpage to use
the new netfs_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
10a7052c78 ceph: fix fscache invalidation
Ensure that we invalidate the fscache whenever we invalidate the
pagecache.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
7c46b31809 ceph: rework PageFsCache handling
With the new fscache API, the PageFsCache bit now indicates that the
page is being written to the cache and shouldn't be modified or released
until it's finished.

Change releasepage and invalidatepage to wait on that bit before
returning.

Also define FSCACHE_USE_NEW_IO_API so that we opt into the new fscache
API.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:21 +02:00
Jeff Layton
e7df4524cd ceph: rip out old fscache readpage handling
With the new netfs read helper functions, we won't need a lot of this
infrastructure as it handles the pagecache pages itself. Rip out the
read handling for now, and much of the old infrastructure that deals in
individual pages.

The cookie handling is mostly unchanged, however.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:21 +02:00
Jeff Layton
ed94f87c2b ceph: don't allow type or device number to change on non-I_NEW inodes
Al pointed out that a malicious or broken MDS could change the type or
device number of a given inode number. It may also be possible for the
MDS to reuse an old inode number.

Ensure that we never allow fill_inode to change the type part of the
i_mode or the i_rdev unless I_NEW is set. Throw warnings if the MDS ever
changes these on us mid-stream, and return an error.

Don't set i_rdev directly, and rely on init_special_inode to do it.
Also, fix up error handling in the callers of ceph_get_inode.

In handle_cap_grant, check for and warn if the inode type changes, and
only overwrite the mode if it didn't.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-08 10:19:37 -05:00
Jeff Layton
3e10a15ffc ceph: fix up error handling with snapdirs
There are several warts in the snapdir error handling. The -EOPNOTSUPP
return in __snapfh_to_dentry is currently lost, and the call to
ceph_handle_snapdir is not currently checked at all.

Fix all of this up and eliminate a BUG_ON in ceph_get_snapdir. We can
handle that case with a warning and return an error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-08 10:19:36 -05:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Xiubo Li
558b4510f6 ceph: defer flushing the capsnap if the Fb is used
If the Fb cap is used it means the current inode is flushing the
dirty data to OSD, just defer flushing the capsnap.

URL: https://tracker.ceph.com/issues/48640
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:52 +01:00
Jeff Layton
a8810cdc00 ceph: allow queueing cap/snap handling after putting cap references
Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton
64f28c627a ceph: clean up inode work queueing
Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton
64f36da562 ceph: fix flush_snap logic after putting caps
A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.

Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.

Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
0d56a4518d
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
e65ce2a50c
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Ilya Dryomov
4972cf605f libceph, ceph: disambiguate ceph_connection_operations handlers
Since a few years, kernel addresses are no longer included in oops
dumps, at least on x86.  All we get is a symbol name with offset and
size.

This is a problem for ceph_connection_operations handlers, especially
con->ops->dispatch().  All three handlers have the same name and there
is little context to disambiguate between e.g. monitor and OSD clients
because almost everything is inlined.  gdb sneakily stops at the first
matching symbol, so one has to resort to nm and addr2line.

Some of these are already prefixed with mon_, osd_ or mds_.  Let's do
the same for all others.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
2021-01-04 17:31:32 +01:00
Ilya Dryomov
60267ba35c ceph: reencode gid_list when reconnecting
On reconnect, cap and dentry releases are dropped and the fields
that follow must be reencoded into the freed space.  Currently these
are timestamp and gid_list, but gid_list isn't reencoded.  This
results in

  failed to decode message of type 24 v4: End of buffer

errors on the MDS.

While at it, make a change to encode gid_list unconditionally,
without regard to what head/which version was used as a result
of checking whether CEPH_FEATURE_FS_BTIME is supported or not.

URL: https://tracker.ceph.com/issues/48618
Fixes: 4f1ddb1ea8 ("ceph: implement updated ceph_mds_request_head structure")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-12-28 20:34:32 +01:00
Ilya Dryomov
ce287162d9 libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
This shouldn't cause any functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
cd1a677cad libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5.  msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
a5cbd5fc22 libceph, ceph: get and handle cluster maps with addrvecs
In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.

MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them.  Decoding is unconditional, see commit
23c625ce30 ("libceph: assume argonaut on the server side").

SERVER_MIMIC doesn't bear any meaning for the kernel client.

Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
285ea34fc8 libceph, ceph: incorporate nautilus cephx changes
- request service tickets together with auth ticket.  Currently we get
  auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
  tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
  Since nautilus, desired service tickets are shared togther with auth
  ticket in CEPHX_GET_AUTH_SESSION_KEY reply.

- propagate session key and connection secret, if any.  In preparation
  for msgr2, update handle_reply() and verify_authorizer_reply() auth
  ops to propagate session key and connection secret.  Since nautilus,
  if secure mode is negotiated, connection secret is shared either in
  CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
  reply (for osds and mdses).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Jeff Layton
4f1ddb1ea8 ceph: implement updated ceph_mds_request_head structure
When we added the btime feature in mainline ceph, we had to extend
struct ceph_mds_request_args so that it could be set. Implement the same
in the kernel client.

Rename ceph_mds_request_head with a _old extension, and a union
ceph_mds_request_args_ext to allow for the extended size of the new
header format.

Add the appropriate code to handle both formats in struct
create_request_message and key the behavior on whether the peer supports
CEPH_FEATURE_FS_BTIME.

The gid_list field in the payload is now populated from the saved
credential. For now, we don't add any support for setting the btime via
setattr, but this does enable us to add that in the future.

[ idryomov: break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
396bd62c69 ceph: clean up argument lists to __prepare_send_request and __send_request
We can always get the mdsc from the session, so there's no need to pass
it in as a separate argument. Pass the session to __prepare_send_request
as well, to prepare for later patches that will need to access it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
7fe0cdeb0f ceph: take a cred reference instead of tracking individual uid/gid
Replace req->r_uid/r_gid with an r_cred pointer and take a reference to
that at the point where we previously would sample the two.  Use that to
populate the uid and gid in the header and release the reference when
the request is freed.

This should enable us to later add support for sending supplementary
group lists in MDS requests.

[ idryomov: break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
0f51a98361 ceph: don't reach into request header for readdir info
We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Xiubo Li
968cd14edc ceph: set osdmap epoch for setxattr
When setting the file/dir layout, it may need data pool info. So
in mds server, it needs to check the osdmap. At present, if mds
doesn't find the data pool specified, it will try to get the latest
osdmap. Now if pass the osd epoch for setxattr, the mds server can
only check this epoch of osdmap.

URL: https://tracker.ceph.com/issues/48504
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Colin Ian King
4a756db2a1 ceph: remove redundant assignment to variable i
The variable i is being initialized with a value that is never read
and it is being updated later with a new value in a for-loop.  The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Luis Henriques
dd980fc0d5 ceph: add ceph.caps vxattr
Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

[ jlayton: change format delimiter to '/' ]

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
bca9fc14c7 ceph: when filling trace, call ceph_get_inode outside of mutexes
Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
   while the read is in flight

2) the same inode is evicted from the cache, and this task gets stuck
   waiting for the page lock because of the above readahead

3) another task is processing a reply trace, and looks up the inode
   being evicted while holding the s_mutex. That ends up waiting for the
   eviction to complete

4) a write reply for an unrelated inode is then processed in the
   ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
   caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so we deadlock
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

[ idryomov: break unnecessarily long lines ]

URL: https://tracker.ceph.com/issues/47998
Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Luis Henriques
6646ea1c8e Revert "ceph: allow rename operation under different quota realms"
This reverts commit dffdcd7145.

When doing a rename across quota realms, there's a corner case that isn't
handled correctly.  Here's a testcase:

  mkdir files limit
  truncate files/file -s 10G
  setfattr limit -n ceph.quota.max_bytes -v 1000000
  mv files limit/

The above will succeed because ftruncate(2) won't immediately notify the
MDSs with the new file size, and thus the quota realms stats won't be
updated.

Since the possible fixes for this issue would have a huge performance impact,
the solution for now is to simply revert to returning -EXDEV when doing a cross
quota realms rename.

URL: https://tracker.ceph.com/issues/48203
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
68cbb8056a ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Luis Henriques
ccd1acdf1c ceph: downgrade warning from mdsmap decode to debug
While the MDS cluster is unstable and changing state the client may get
mdsmap updates that will trigger warnings:

  [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay)

This patch downgrades these warnings to debug, as they may flood the logs
if the cluster is unstable for a while.

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Luis Henriques
e5cafce3ad ceph: fix race in concurrent __ceph_remove_cap invocations
A NULL pointer dereference may occur in __ceph_remove_cap with some of the
callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
remove_session_caps_cb. Those callers hold the session->s_mutex, so they
are prevented from concurrent execution, but ceph_evict_inode does not.

Since the callers of this function hold the i_ceph_lock, the fix is simply
a matter of returning immediately if caps->ci is NULL.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/43272
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
4a357f5069 ceph: pass down the flags to grab_cache_page_write_begin
write_begin operations are passed a flags parameter that we need to
mirror here, so that we don't (e.g.) recurse back into filesystem code
inappropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li
5a9e2f5d55 ceph: add ceph.{cluster_fsid/client_id} vxattrs
These two vxattrs will only exist in local client side, with which
we can easily know which mountpoint the file belongs to and also
they can help locate the debugfs path quickly.

URL: https://tracker.ceph.com/issues/48057
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li
247b1f19db ceph: add status debugfs file
This will help list some useful client side info, like the client
entity address/name and blocklisted status, etc.

URL: https://tracker.ceph.com/issues/48057
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
04fabb1199 ceph: ensure we have Fs caps when fetching dir link count
The link count for a directory is defined as inode->i_subdirs + 2,
(for "." and ".."). i_subdirs is only populated when Fs caps are held.
Ensure we grab Fs caps when fetching the link count for a directory.

[ idryomov: break unnecessarily long line ]

URL: https://tracker.ceph.com/issues/48125
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Xiubo Li
8ba3b8c7fb ceph: send dentry lease metrics to MDS daemon
For the old ceph version, if it received this one metric message
containing the dentry lease metric info, it will just ignore it.

URL: https://tracker.ceph.com/issues/43423
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
81048c00d1 ceph: acquire Fs caps when getting dir stats
We only update the inode's dirstats when we have Fs caps from the MDS.

Declare a new VXATTR_FLAG_DIRSTAT that we set on all dirstats, and have
the vxattr handling code acquire those caps when it's set.

URL: https://tracker.ceph.com/issues/48104
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
06a1ad438b ceph: fix up some warnings on W=1 builds
Convert some decodes into unused variables into skips, and fix up some
non-kerneldoc comment headers to not start with "/**".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
4ae3713fe4 ceph: queue MDS requests to REJECTED sessions when CLEANRECOVER is set
Ilya noticed that the first access to a blacklisted mount would often
get back -EACCES, but then subsequent calls would be OK. The problem is
in __do_request. If the session is marked as REJECTED, a hard error is
returned instead of waiting for a new session to come into being.

When the session is REJECTED and the mount was done with
recover_session=clean, queue the request to the waiting_for_map queue,
which will be awoken after tearing down the old session. We can only
do this for sync requests though, so check for async ones first and
just let the callers redrive a sync request.

URL: https://tracker.ceph.com/issues/47385
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:47 +01:00
Jeff Layton
dbeec07bc8 ceph: remove timeout on allowing reconnect after blocklisting
30 minutes is a long time to wait, and this makes it difficult to test
the feature by manually blocklisting clients. Remove the timeout
infrastructure and just allow the client to reconnect at will.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton
50c9132ddf ceph: add new RECOVER mount_state when recovering session
When recovering a session (a'la recover_session=clean), we want to do
all of the operations that we do on a forced umount, but changing the
mount state to SHUTDOWN is can cause queued MDS requests to fail when
the session comes back. Most of those can idle until the session is
recovered in this situation.

Reserve SHUTDOWN state for forced umount, and make a new RECOVER state
for the forced reconnect situation. Change several tests for equality with
SHUTDOWN to test for that or RECOVER.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton
aa5c791053 ceph: make fsc->mount_state an int
This field is an unsigned long currently, which is a bit of a waste on
most arches since this just holds an enum. Make it (signed) int instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton
dc167e38a0 ceph: don't WARN when removing caps due to blocklisting
We expect to remove dirty caps when the client is blocklisted. Don't
throw a warning in that case.

[ idryomov: break unnecessarily long line ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:46 +01:00
Jeff Layton
62575e270f ceph: check session state after bumping session->s_seq
Some messages sent by the MDS entail a session sequence number
increment, and the MDS will drop certain types of requests on the floor
when the sequence numbers don't match.

In particular, a REQUEST_CLOSE message can cross with one of the
sequence morphing messages from the MDS which can cause the client to
stall, waiting for a response that will never come.

Originally, this meant an up to 5s delay before the recurring workqueue
job kicked in and resent the request, but a recent change made it so
that the client would never resend, causing a 60s stall unmounting and
sometimes a blockisting event.

Add a new helper for incrementing the session sequence and then testing
to see whether a REQUEST_CLOSE needs to be resent, and move the handling
of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
bare sequence counter increments to use the new helper.

Reorganize check_session_state with a switch statement.  It should no
longer be called when the session is CLOSING, so throw a warning if it
ever is (but still handle that case sanely).

[ idryomov: whitespace, pr_err() call fixup ]

URL: https://tracker.ceph.com/issues/47563
Fixes: fa99677342 ("ceph: fix potential mdsc use-after-free crash")
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-11-04 20:55:49 +01:00
Linus Torvalds
0eac1102e9 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place (the largest group here is
  Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove KSTAT_QUERY_FLAGS
  fs: remove vfs_stat_set_lookup_flags
  fs: move vfs_fstatat out of line
  fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
  fs: remove vfs_statx_fd
  fs: omfs: use kmemdup() rather than kmalloc+memcpy
  [PATCH] reduce boilerplate in fsid handling
  fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
  selftests: mount: add nosymfollow tests
  Add a "nosymfollow" mount option.
2020-10-24 12:26:05 -07:00
Jeff Layton
c74d79af90 ceph: comment cleanups and clarifications
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
16d68903f5 ceph: break up send_cap_msg
Push the allocation of the msg and the send into the caller. Rename
the function to encode_cap_msg and make it void return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
5231198089 ceph: drop separate mdsc argument from __send_cap
We can get it from the session if we need it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Matthew Wilcox (Oracle)
c403c3a2fb ceph: promote to unsigned long long before shifting
On 32-bit systems, this shift will overflow for files larger than 4GB.

Cc: stable@vger.kernel.org
Fixes: 61f6881621 ("ceph: check caps in filemap_fault and page_mkwrite")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
7edf1ec5b2 ceph: don't SetPageError on readpage errors
PageError really only has meaning within a particular subsystem. Nothing
looks at this bit in the core kernel code, and ceph itself doesn't care
about it. Don't bother setting the PageError bit on error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
f6fbdcd997 ceph: mark ceph_fmt_xattr() as printf-like for better type checking
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
1cc1699070 ceph: fold ceph_update_writeable_page into ceph_write_begin
...and reorganize the loop for better clarity.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
6390987f2f ceph: fold ceph_sync_writepages into writepage_nounlock
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
9b4862ecae ceph: fold ceph_sync_readpages into ceph_readpage
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
d45156bf46 ceph: don't call ceph_update_writeable_page from page_mkwrite
page_mkwrite should only be called with Uptodate pages, so we should
only need to flush incompatible snap contexts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton
18d620f063 ceph: break out writeback of incompatible snap context to separate function
When dirtying a page, we have to flush incompatible contexts. Move the
search for an incompatible context into a separate function, and fix up
the caller to wait and retry if there is one.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
4bb926e83f ceph: add a note explaining session reject error string
error_string key in the metadata map of MClientSession message
is intended for humans, but unfortunately became part of the on-wire
format with the introduction of recover_session=clean mode in commit
131d7eb4fa ("ceph: auto reconnect after blacklisted").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Ilya Dryomov
0b98acd618 libceph, rbd, ceph: "blacklist" -> "blocklist"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton
2e16929660 ceph: have ceph_writepages_start call pagevec_lookup_range_tag
Currently it calls pagevec_lookup_range_nr_tag(), but that may be
inefficient, as we might end up having to search several times as we get
down to looking for fewer pages to fill the array.

Thus spake Willy:

"I think ceph is misusing pagevec_lookup_range_nr_tag().  Let's suppose
 you get a range which is AAAAbbbbAAAAbbbbAAAAbbbbbbbb(...)bbbbAAAA and
 you try to fetch max_pages=13.  First loop will get AAAAbbbbAAAAb and
 have 8 locked_pages.  The next call will get bbbAA and now
 locked_pages=10.  Next call gets AAb ... and now you're iterating your
 way through all the 'b' one page at a time until you find that first A."

'A' here refers to pages that are eligible for writeback and 'b'
represents ones that aren't (for whatever reason).

Not capping the number of return pages may mean that we sometimes find
more pages than are needed, but the extra references will just get put
at the end.

Ceph is also the only caller of pagevec_lookup_range_nr_tag(), so this
change should allow us to eliminate that call as well. That will be done
in a follow-on patch.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton
470a5c77ea ceph: use kill_anon_super helper
ceph open-codes this around some other activity and the rationale
for it isn't clear. There is no need to delay free_anon_bdev until
the end of kill_sb.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Xiubo Li
1dd8d47081 ceph: metrics for opened files, pinned caps and opened inodes
In client for each inode, it may have many opened files and may
have been pinned in more than one MDS servers. And some inodes
are idle, which have no any opened files.

This patch will show these metrics in the debugfs, likes:

item                               total
-----------------------------------------
opened files  / total inodes       14 / 5
pinned i_caps / total inodes       7  / 5
opened inodes / total inodes       3  / 5

Will send these metrics to ceph, which will be used by the `fs top`,
later.

[ jlayton: drop unrelated hunk, count hashed inodes instead of
           allocated ones ]

URL: https://tracker.ceph.com/issues/47005
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Xiubo Li
2678da88f4 ceph: add ceph_sb_to_mdsc helper support to parse the mdsc
This will help simplify the code.

[ jlayton: fix minor merge conflict in quota.c ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton
c5f575ed08 ceph: drop special-casing for ITER_PIPE in ceph_sync_read
This special casing was added in 7ce469a53e (ceph: fix splice
read for no Fc capability case). The confirm callback for ITER_PIPE
expects that the page is Uptodate and returns an error otherwise.

A simpler workaround is just to use the Uptodate bit, which has no
meaning for anonymous pages. Rip out the special casing for ITER_PIPE
and just SetPageUptodate before we copy to the iter.

Cc: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Yanhu Cao
3a8ebe0b8b ceph: add column 'mds' to show caps in more user friendly
In multi-mds, the 'caps' debugfs file will have duplicate ino,
add the 'mds' column to indicate which mds session the cap belongs to.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Luis Henriques
1c30c90733 ceph: remove unnecessary return in switch statement
Since there's a return immediately after the 'break', there's no need for
this extra 'return' in the S_IFDIR case.

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Yan, Zheng
a33f6432b3 ceph: encode inodes' parent/d_name in cap reconnect message
Since nautilus, MDS tracks dirfrags whose child inodes have caps in open
file table. When MDS recovers, it prefetches all of these dirfrags. This
avoids using backtrace to load inodes. But dirfrags prefetch may load
lots of useless inodes into cache, and make MDS run out of memory.

Recent MDS adds an option that disables dirfrags prefetch. When dirfrags
prefetch is disabled. Recovering MDS only prefetches corresponding dir
inodes. Including inodes' parent/d_name in cap reconnect message can
help MDS to load inodes into its cache.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:25 +02:00
Al Viro
6d1349c769 [PATCH] reduce boilerplate in fsid handling
Get rid of boilerplate in most of ->statfs()
instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-18 16:45:50 -04:00
Linus Torvalds
b0bfd5eca9 We have an inode number handling change, prompted by s390x which is
a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases
 to avoid potential data integrity issues when CephFS is re-exported
 via NFS or CIFS and a fix for the bulk of W=1 compilation warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl9I9UITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5HcB/0XKCF9pJmHM7vyMpZvALaNn97B7jgD
 o4Lq0RBnIyum0UltnxvrROO1LGyurFH0hWofRykEWXy8Qt/9fZTddClOmY+6W+7P
 A1MVInSxCXpNHj5y1uMuwhWkLmuNWnW1aKJXBn8tXKsmrYQM3SmbCMRx8a2GxJk9
 mxl9zAtTRsih0AovRbde93i4FMpeXsyDk9EcyiJFcgnDjJpCAXgN9smu9xjPskOI
 rrFQBjsrYB04FVuv5lEB/xZI/2QLM2FqzlpIRsa6udtYRsDnOtgIbLq8p0sNJnC6
 QFxDaWFUJAXN/b1UWZz8yw2Y7nF7r47Hg6fDM4PBQ09PQEbIQKY+qn3R
 =6+bm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "We have an inode number handling change, prompted by s390x which is a
  64-bit architecture with a 32-bit ino_t, a patch to disallow leases to
  avoid potential data integrity issues when CephFS is re-exported via
  NFS or CIFS and a fix for the bulk of W=1 compilation warnings"

* tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client:
  ceph: don't allow setlease on cephfs
  ceph: fix inode number handling on arches with 32-bit ino_t
  libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
2020-08-28 10:33:04 -07:00
Jeff Layton
496ceaf124 ceph: don't allow setlease on cephfs
Leases don't currently work correctly on kcephfs, as they are not broken
when caps are revoked. They could eventually be implemented similarly to
how we did them in libcephfs, but for now don't allow them.

[ idryomov: no need for simple_nosetlease() in ceph_dir_fops and
  ceph_snapdir_fops ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-24 20:06:54 +02:00
Jeff Layton
ebce3eb2f7 ceph: fix inode number handling on arches with 32-bit ino_t
Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).

I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).

What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.

Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.

The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.

While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.

NOTE: This is a user-visible change on 32-bit arches:

1/ inode numbers will be seen to have changed between kernel versions.
   32-bit arches will see large inode numbers now instead of the hashed
   ones they saw before.

2/ any really old software not built with LFS support may start failing
   stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
   can do about these, but hopefully the intersection of people running
   such code on ceph will be very small.

The workaround for both problems is to mount with "-o ino32".

[ idryomov: changelog tweak ]

URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-24 17:25:26 +02:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Jeff Layton
02e37571f9 ceph: handle zero-length feature mask in session messages
Most session messages contain a feature mask, but the MDS will
routinely send a REJECT message with one that is zero-length.

Commit 0fa8263367 ("ceph: fix endianness bug when handling MDS
session feature bits") fixed the decoding of the feature mask,
but failed to account for the MDS sending a zero-length feature
mask. This causes REJECT message decoding to fail.

Skip trying to decode a feature mask if the word count is zero.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46823
Fixes: 0fa8263367 ("ceph: fix endianness bug when handling MDS session feature bits")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-05 17:47:07 +02:00
Yanhu Cao
224c7b6778 ceph: use frag's MDS in either mode
When doing some tests with multiple mds, we were seeing many mds
forwarding requests between them, causing clients to resend.

If the request is a modification operation and the mode is set to
USE_AUTH_MDS, then the auth mds should be selected to handle the
request. If auth mds for frag is already set, then it should be returned
directly without further processing.

The current logic is wrong because it only returns directly if
mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
then when the frag's mds is not equal to cap session's mds, the request
will get sent to the wrong MDS needlessly.

Drop the mode check in this condition.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-04 19:41:14 +02:00
Jeff Layton
a0102bda5b ceph: move sb->wb_pagevec_pool to be a global mempool
When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.

128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-04 19:41:12 +02:00
Jeff Layton
b748fc7a87 ceph: set sec_context xattr on symlink creation
Symlink inodes should have the security context set in their xattrs on
creation. We already set the context on creation, but we don't attach
the pagelist. The effect is that symlink inodes don't get an SELinux
context set on them at creation, so they end up unlabeled instead of
inheriting the proper context. Make it do so.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-04 19:41:11 +02:00
Colin Ian King
2c81ef286c ceph: remove redundant initialization of variable mds
The variable mds is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:28 +02:00
Xiubo Li
a7caa88f8b ceph: fix use-after-free for fsc->mdsc
If the ceph_mdsc_init() fails, it will free the mdsc already.

Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:27 +02:00
Jia Yang
8e298deb8d ceph: remove unused variables in ceph_mdsmap_decode()
Fix build warnings:

  fs/ceph/mdsmap.c: In function ‘ceph_mdsmap_decode’:
  fs/ceph/mdsmap.c:192:7: warning: variable ‘info_cv’ set but not used [-Wunused-but-set-variable]
  fs/ceph/mdsmap.c:177:7: warning: variable ‘state_seq’ set but not used [-Wunused-but-set-variable]
  fs/ceph/mdsmap.c:123:15: warning: variable ‘mdsmap_cv’ set but not used [-Wunused-but-set-variable]

Note that p is increased in ceph_decode_*.

Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:27 +02:00
Randy Dunlap
f1f565a269 ceph: delete repeated words in fs/ceph/
Drop duplicated words "down" and "the" in fs/ceph/.

[ idryomov: merge into a single patch ]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:27 +02:00
Xiubo Li
3b4168dd8b ceph: send client provided metric flags in client metadata
Send metric flags to the MDS, indicating what metrics the client
supports. Currently that consists of cap statistics, and read, write and
metadata latencies.

URL: https://tracker.ceph.com/issues/43435
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:27 +02:00
Xiubo Li
18f473b384 ceph: periodically send perf metrics to MDSes
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client.  It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.

We can disable the metric sending via the disable_send_metrics module
parameter.

[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:26 +02:00
Xiubo Li
aaf5a47620 ceph: check the sesion state and return false in case it is closed
If the session is already in closed state, we should skip it.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:26 +02:00
Alexander A. Klimov
94f17c00d6 libceph: replace HTTP links with HTTPS ones
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
  If not .svg:
    For each line:
      If doesn't contain `\bxmlns\b`:
        For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
            If both the HTTP and HTTPS versions
            return 200 OK and serve the same content:
              Replace HTTP with HTTPS.

[ idryomov: Do the same for the CRUSH paper and replace
  ceph.newdream.net with ceph.io. ]

Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:26 +02:00
Xu Wang
c00e4522ad ceph: remove unnecessary cast in kfree()
Remove unnecassary casts in the argument to kfree.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:26 +02:00
Xiubo Li
d1d9655052 ceph: do not access the kiocb after aio requests
In aio case, if the completion comes very fast just before the
ceph_read_iter() returns to fs/aio.c, the kiocb will be freed in
the completion callback, then if ceph_read_iter() access again
we will potentially hit the use-after-free bug.

[ jlayton: initialize direct_lock early, and use it everywhere ]

URL: https://tracker.ceph.com/issues/45649
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:25 +02:00
Jeff Layton
585d72f33e ceph: clean up and optimize ceph_check_delayed_caps()
Make this loop look a bit more sane. Also optimize away the spinlock
release/reacquire if we can't get an inode reference.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:21 +02:00
Xiubo Li
fa99677342 ceph: fix potential mdsc use-after-free crash
Make sure the delayed work stopped before releasing the resources.

cancel_delayed_work_sync() will only guarantee that the work finishes
executing if the work is already in the ->worklist.  That means after
the cancel_delayed_work_sync() returns, it will leave the work requeued
if it was rearmed at the end. That can lead to a use after free once the
work struct is freed.

Fix it by flushing the delayed work instead of trying to cancel it, and
ensure that the work doesn't rearm if the mdsc is stopping.

URL: https://tracker.ceph.com/issues/46293
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:21 +02:00
Xiubo Li
b682c6d41b ceph: switch to WARN_ON_ONCE in encode_supported_features()
...and let the errnos bubble up to the callers.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:16 +02:00
Xiubo Li
4f1d756def ceph: add global total_caps to count the mdsc's total caps number
This will help to reduce using the global mdsc->mutex lock in many
places.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:15 +02:00
Xiubo Li
3e699bd865 ceph: add check_session_state() helper and make it global
And remove the unsed mdsc parameter to simplify the code.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-03 11:05:10 +02:00
Xiubo Li
e64f44a884 ceph: skip checking caps when session reconnecting and releasing reqs
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.

URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Xiubo Li
ea8412b284 ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
held. That inverts the locking order documented in mds_client.h. Drop
the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
to prevent a deadlock.

URL: https://tracker.ceph.com/issues/45609
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
878dabb641 ceph: don't return -ESTALE if there's still an open file
Similarly to commit 03f219041f ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at.  The issue has been detected by
xfstest generic/467, when doing:

 - name_to_handle_at("/cephfs/myfile")
 - open("/cephfs/myfile")
 - unlink("/cephfs/myfile")
 - sync; sync;
 - drop caches
 - open_by_handle_at()

The call to open_by_handle_at should not fail because the file hasn't been
deleted yet (only unlinked) and we do have a valid handle to it.  -ESTALE
shall be returned only if i_nlink is 0 *and* i_count is 1.

This patch also makes sure we have LINK caps before checking i_nlink.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
dffdcd7145 ceph: allow rename operation under different quota realms
Returning -EXDEV when trying to 'mv' files/directories from different
quota realms results in copy+unlink operations instead of the faster
CEPH_MDS_OP_RENAME.  This will occur even when there aren't any quotas
set in the destination directory, or if there's enough space left for
the new file(s).

This patch adds a new helper function to be called on rename operations
which will allow these operations if they can be executed.  This patch
mimics userland fuse client commit b8954e5734b3 ("client:
optimize rename operation under different quota root").

Since ceph_quota_is_same_realm() is now called only from this new
helper, make it static.

URL: https://tracker.ceph.com/issues/44791
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Luis Henriques
daa668fbac ceph: normalize 'delta' parameter usage in check_quota_exceeded
Function check_quota_exceeded() uses delta parameter only for the
QUOTA_CHECK_MAX_BYTES_OP operation.  Using this parameter also for
MAX_FILES will makes the code cleaner and will be required to support
cross-quota-tree renames.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Jeff Layton
829ad4db95 ceph: ceph_kick_flushing_caps needs the s_mutex
The mdsc->cap_dirty_lock is not held while walking the list in
ceph_kick_flushing_caps, which is not safe.

ceph_early_kick_flushing_caps does something similar, but the
s_mutex is held while it's called and I think that guards against
changes to the list.

Ensure we hold the s_mutex when calling ceph_kick_flushing_caps,
and add some clarifying comments.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:53 +02:00
Jeff Layton
d67c72e6cc ceph: request expedited service on session's last cap flush
When flushing a lot of caps to the MDS's at once (e.g. for syncfs),
we can end up waiting a substantial amount of time for MDS replies, due
to the fact that it may delay some of them so that it can batch them up
together in a single journal transaction. This can lead to stalls when
calling sync or syncfs.

What we'd really like to do is request expedited service on the _last_
cap we're flushing back to the server. If the CHECK_CAPS_FLUSH flag is
set on the request and the current inode was the last one on the
session->s_cap_dirty list, then mark the request with
CEPH_CLIENT_CAPS_SYNC.

Note that this heuristic is not perfect. New inodes can race onto the
list after we've started flushing, but it does seem to fix some common
use cases.

URL: https://tracker.ceph.com/issues/44744
Reported-by: Jan Fajerski <jfajerski@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
1cf03a68e7 ceph: convert mdsc->cap_dirty to a per-session list
This is a per-sb list now, but that makes it difficult to tell when
the cap is the last dirty one associated with the session. Switch
this to be a per-session list, but continue using the
mdsc->cap_dirty_lock to protect the lists.

This list is only ever walked in ceph_flush_dirty_caps, so change that
to walk the sessions array and then flush the caps for inodes on each
session's list.

If the auth cap ever changes while the inode has dirty caps, then
move the inode to the appropriate session for the new auth_cap. Also,
ensure that we never remove an auth cap while the inode is still on the
s_cap_dirty list.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Yan, Zheng
6f05b30ea0 ceph: reset i_requested_max_size if file write is not wanted
write can stuck at waiting for larger max_size in following sequence of
events:

- client opens a file and writes to position 'A' (larger than unit of
  max size increment)
- client closes the file handle and updates wanted caps (not wanting
  file write caps)
- client opens and truncates the file, writes to position 'A' again.

At the 1st event, client set inode's requested_max_size to 'A'. At the
2nd event, mds removes client's writable range, but client does not reset
requested_max_size. At the 3rd event, client does not request max size
because requested_max_size is already larger than 'A'.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
88828190f0 ceph: throw a warning if we destroy session with mutex still locked
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
dc3da0461c ceph: fix potential race in ceph_check_caps
Nothing ensures that session will still be valid by the time we
dereference the pointer. Take and put a reference.

In principle, we should always be able to get a reference here, but
throw a warning if that's ever not the case.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
4fb5dda39c ceph: document what protects i_dirty_item and i_flushing_item
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
7833323363 ceph: don't take i_ceph_lock in handle_cap_import
Just take it before calling it. This means we have to do a couple of
minor in-memory operations under the spinlock now, but those shouldn't
be an issue.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
7391fba267 ceph: don't release i_ceph_lock in handle_cap_trunc
There's no reason to do this here. Just have the caller handle it.
Also, add a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
d7dbfb4f2b ceph: add comments for handle_cap_flush_ack logic
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
681ac63488 ceph: split up __finish_cap_flush
This function takes a mdsc argument or ci argument, but if both are
passed in, it ignores the ci arg. Fortunately, nothing does that, but
there's no good reason to have the same function handle both cases.

Also, get rid of some branches and just use |= to set the wake_* vals.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:52 +02:00
Jeff Layton
0a454bdd50 ceph: reorganize __send_cap for less spinlock abuse
Get rid of the __releases annotation by breaking it up into two
functions: __prep_cap which is done under the spinlock and __send_cap
that is done outside it. Add new fields to cap_msg_args for the wake
boolean and old_xattr_buf pointer.

Nothing checks the return value from __send_cap, so make it void
return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
70c948206f ceph: add metadata perf metric support
Add a new "r_ended" field to struct ceph_mds_request and use that to
maintain the average latency of MDS requests.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
97e27aaa9a ceph: add read/write latency metric support
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.

Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
1af16d547f ceph: add caps perf metric for each superblock
Count hits and misses in the caps cache. If the client has all of
the necessary caps when a task needs references, then it's counted
as a hit. Any other situation is a miss.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Xiubo Li
f9009efac4 ceph: add dentry lease metric support
For dentry leases, only count the hit/miss info triggered from the vfs
calls. For the cases like request reply handling and ceph_trim_dentries,
ignore them.

For now, these are only viewable using debugfs. Future patches will
allow the client to send the stats to the MDS.

The output looks like:

item          total           miss            hit
-------------------------------------------------
d_lease       11              7               141

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
Jeff Layton
fb33c114d3 ceph: flush release queue when handling caps for unknown inode
It's possible for the VFS to completely forget about an inode, but for
it to still be sitting on the cap release queue. If the MDS sends the
client a cap message for such an inode, it just ignores it today, which
can lead to a stall of up to 5s until the cap release queue is flushed.

If we get a cap message for an inode that can't be located, then go
ahead and flush the cap release queue.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/45532
Fixes: 1e9c2eb681 ("ceph: delete stale dentry when last reference is dropped")
Reported-and-Tested-by: Andrej Filipčič <andrej.filipcic@ijs.si>
Suggested-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-27 13:03:57 +02:00
Linus Torvalds
0a85ed6e7f block-5.7-2020-05-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl63WVAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkXWD/9qJgqQpPkigCCwwPHZ+phthw6gHeAgBxPH
 Cw6P9QB4QCdacZjQA6QH3zdxaDsCCitQRioWPgxngs1326TKYNzBi7U3eTEwiK12
 cnRybLnkzei4yzYVUSJk637oOoQh3CiJLvYcJBppGFi7crpbvlQv68M2hu05vhwL
 R/91H62X/5UaUlc1cJV63OBk8euWzF6XNbCQQrR4ayDvz+BsV5Fs72vYa1gx7qIt
 as/67oTT6y4U4pd74nT4OGkxDIXbXfn2eTbh5sMNc4ilBkqMyNbf8aOHdWqXZIBd
 18RKpNl6h/fiDMJ0jsGliReONLjfRBcJla68Kn1AFONMcyxcXidjptOwLOt2fYWf
 YMguCVMhfgxVBslzLWoQ9AWSiNVh36ycORWlCOrnRaOaQCb9OaLZ2fwibfZ0JsMd
 0259Z5vA7MIUoobCc5akXOYHbpByA9FSYkKudgTYLpdjkn05kxQyA12GgJjW3sVw
 ZRjoUuDuZDDUct6JcLWdrlONT8st05g+qf6PCoD+Jac8HtbpqHfKJJUtYecUat75
 4hGKhuvTzpuVY0wNHo3sgqKfsejQODTN6UhejNI11Zs/nx6O0ze/qoDuWZHncnKl
 158le+K5rNS8SUNbDBTMWp3OX4SJm/Gsf30fOWkkt6z1iaEfKc5sCxBHvSOeBEvH
 M9pzy56Vtw==
 =73nU
 -----END PGP SIGNATURE-----

Merge tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - a small series fixing a use-after-free of bdi name (Christoph,Yufen)

 - NVMe fix for a regression with the smaller CQ update (Alexey)

 - NVMe fix for a hang at namespace scanning error recovery (Sagi)

 - fix race with blk-iocost iocg->abs_vdebt updates (Tejun)

* tag 'block-5.7-2020-05-09' of git://git.kernel.dk/linux-block:
  nvme: fix possible hang when ns scanning fails during error recovery
  nvme-pci: fix "slimmer CQ head update"
  bdi: add a ->dev_name field to struct backing_dev_info
  bdi: use bdi_dev_name() to get device name
  bdi: move bdi_dev_name out of line
  vboxsf: don't use the source name in the bdi name
  iocost: protect iocg->abs_vdebt with iocg->waitq.lock
2020-05-10 11:16:07 -07:00
Yufen Yu
d51cfc53ad bdi: use bdi_dev_name() to get device name
Use the common interface bdi_dev_name() to get device name.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Add missing <linux/backing-dev.h> include BFQ

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:07:39 -06:00
Luis Henriques
12ae44a40a ceph: demote quotarealm lookup warning to a debug message
A misconfigured cephx can easily result in having the kernel client
flooding the logs with:

  ceph: Can't lookup inode 1 (err: -13)

Change this message to debug level.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/44546
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-08 18:44:40 +02:00
Wu Bo
4d8e28ff31 ceph: fix double unlock in handle_cap_export()
If the ceph_mdsc_open_export_target_session() return fails, it will
do a "goto retry", but the session mutex has already been unlocked.
Re-lock the mutex in that case to ensure that we don't unlock it
twice.

Signed-off-by: Wu Bo <wubo40@huawei.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04 19:14:23 +02:00
Wu Bo
7d8976afad ceph: fix special error code in ceph_try_get_caps()
There are 3 speical error codes: -EAGAIN/-EFBIG/-ESTALE.
After calling try_get_cap_refs, ceph_try_get_caps test for the
-EAGAIN twice. Ensure that it tests for -ESTALE instead.

Signed-off-by: Wu Bo <wubo40@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04 19:14:23 +02:00
Jeff Layton
0fa8263367 ceph: fix endianness bug when handling MDS session feature bits
Eduard reported a problem mounting cephfs on s390 arch. The feature
mask sent by the MDS is little-endian, so we need to convert it
before storing and testing against it.

Cc: stable@vger.kernel.org
Reported-and-Tested-by: Eduard Shishkin <edward6@linux.ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-04 19:14:23 +02:00
Jeff Layton
2a575f138d ceph: fix potential bad pointer deref in async dirops cb's
The new async dirops callback routines can pass ERR_PTR values to
ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path
ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane
even if ceph_mdsc_build_path fails.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-04-13 19:33:47 +02:00
Luis Henriques
ef9157259f ceph: fix snapshot directory timestamps
The .snap directory timestamps are kept at 0 (1970-01-01 00:00), which
isn't consistent with what the fuse client does.  This patch makes the
behaviour consistent, by setting these timestamps (atime, btime, ctime,
mtime) to those of the parent directory.

Cc: Marc Roos <M.Roos@f1-outsourcing.eu>
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:43 +02:00
Yan, Zheng
9bccb76574 ceph: wait for async creating inode before requesting new max size
ceph_check_caps() can't request new max size for async creating inode.
This may make ceph_get_caps() loop busily until getting reply of the
async create. Also, wait for async creating reply before calling
ceph_renew_caps().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:43 +02:00
Yan, Zheng
0aa971b6fd ceph: don't skip updating wanted caps when cap is stale
1. try_get_cap_refs() fails to get caps and finds that mds_wanted
   does not include what it wants. It returns -ESTALE.
2. ceph_get_caps() calls ceph_renew_caps(). ceph_renew_caps() finds
   that inode has cap, so it calls ceph_check_caps().
3. ceph_check_caps() finds that issued caps (without checking if it's
   stale) already includes caps wanted by open file, so it skips
   updating wanted caps.

Above events can cause an infinite loop inside ceph_get_caps().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:43 +02:00
Yan, Zheng
42d70f8e31 ceph: request new max size only when there is auth cap
When there is no auth cap, check_max_size() can't do anything and may
cause an infinite loop inside ceph_get_caps().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:43 +02:00
Yan, Zheng
546d402085 ceph: cleanup return error of try_get_cap_refs()
Returns 0 if caps were not able to be acquired (yet), 1 if cap
acquisition succeeded, or a negative error code. There are 3 special
error codes:

-EAGAIN: need to sleep but non-blocking is specified
-EFBIG:  ask caller to call check_max_size() and try again.
-ESTALE: ask caller to call ceph_renew_caps() and try again.

[ jlayton: add WARN_ON_ONCE check for -EAGAIN ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Qiujun Huang
c6d5029603 ceph: return ceph_mdsc_do_request() errors from __get_parent()
Return the error returned by ceph_mdsc_do_request(). Otherwise,
r_target_inode ends up being NULL this ends up returning ENOENT
regardless of the error.

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
bf73c62e7f ceph: check all mds' caps after page writeback
If an inode has caps from multiple mds's, the following can happen:

- non-auth mds revokes Fsc. Fcb is used, so page writeback is queued.
- when writeback finishes, ceph_check_caps() is called with auth only
  flag. ceph_check_caps() invalidates pagecache, but skips checking any
  non-auth caps.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
11ba6b9cee ceph: update i_requested_max_size only when sending cap msg to auth mds
Non-auth mds can't do anything to 'update max' cap message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
135e671e54 ceph: simplify calling of ceph_get_fmode()
Originally, calling ceph_get_fmode() for open files is by thread that
handles request reply. There is a small window between updating caps and
and waking the request initiator. We need to prevent ceph_check_caps()
from releasing wanted caps in the window.

Previous patches made fill_inode() call __ceph_touch_fmode() for open file
requests. This prevented ceph_check_caps() from releasing wanted caps for
'caps_wanted_delay_min' seconds, enough for request initiator to get
woken up and call ceph_get_fmode().

This allows us to now call ceph_get_fmode() in ceph_open() instead.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
a0d93e327f ceph: remove delay check logic from ceph_check_caps()
__ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and
'caps_wanted_delay_max'. There is no need to duplicate the logic in
ceph_check_caps() and __send_cap()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
719a2514e9 ceph: consider inode's last read/write when calculating wanted caps
Add i_last_rd and i_last_wr to ceph_inode_info. These fields are
used to track the last time the client acquired read/write caps for
the inode.

If there is no read/write on an inode for 'caps_wanted_delay_max'
seconds, __ceph_caps_file_wanted() does not request caps for read/write
even there are open files.

Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted()
calculates dir's wanted caps according to last dir read/modification. If
there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If
there is recent dir modification, also wants CEPH_CAP_FILE_EXCL.

Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after
readdir, as with that, modifications do not need to release
CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
c0e385b106 ceph: always renew caps if mds_wanted is insufficient
Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag,
which indicates that mds has closed the session and caps were dropped.
Remove this flag in preparation for not requesting caps for idle open
files.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
3313f66a57 ceph: update dentry lease for async create
Otherwise ceph_d_delete() may return 1 for the dentry, which makes
dput() prune the dentry and clear parent dir's complete flag.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
9a8d03ca2e ceph: attempt to do async create when possible
With the Octopus release, the MDS will hand out directory create caps.

If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect the
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the wsync/nowsync mount options.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
785892fe88 ceph: cache layout in parent dir on first sync create
If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
6deb8008a8 ceph: add new MDS req field to hold delegated inode number
Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
d484648787 ceph: decode interval_sets for delegated inos
Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
966c716018 ceph: make ceph_fill_inode non-static
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Jeff Layton
2ccb45462a ceph: perform asynchronous unlink if we have sufficient caps
The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.

When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.

In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.

If the operation ends up failing, then set a writeback error on the
directory inode, and the inode itself that can be fetched later by
an fsync on the dir.

The behavior of dir caps is slightly different from caps on normal
files. Because these are just considered an optimization, if the
session is reconnected, we will not automatically reclaim them. They
are instead considered lost until we do another synchronous op in the
parent directory.

Async dirops are enabled via the "nowsync" mount option, which is
patterned after the xfs "wsync" mount option. For now, the default
is "wsync", but eventually we may flip that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:42 +02:00
Yan, Zheng
173e70e8ac ceph: don't take refs to want mask unless we have all bits
If we don't have all of the cap bits for the want mask in
try_get_cap_refs, then just take refs on the need bits.

Signed-off-by: "Yan, Zheng" <ukernel@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
a25949b990 ceph: cap tracking for async directory operations
Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).

Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
40dcf75e82 ceph: make __take_cap_refs non-static
Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
891f3f5a6a ceph: add infrastructure for waiting for async create to complete
When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
f5e17aed3a ceph: track primary dentry link
Newer versions of the MDS will flag a dentry as "primary". In later
patches, we'll need to consult this info, so track it in di->flags.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
3bb48b4142 ceph: add flag to designate that a request is asynchronous
...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
c7e4f85ce9 ceph: more caps.c lockdep assertions
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
e8a4d26771 ceph: clean up kick_flushing_inode_caps()
The last thing that this function does is release i_ceph_lock, so
have the caller do that instead. Add a lockdep assertion to
ensure that the function is always called with i_ceph_lock held.
Change the prototype to take a ceph_inode_info pointer and drop
the separate mdsc argument as we can get that from the session.

While at it, make it non-static.  We'll need this to kick any
flushing caps once the create reply comes in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Xiubo Li
8ccf7fcce1 ceph: return ETIMEDOUT errno to userland when request timed out
req->r_timeout is only used during mounting, so this error will
be more accurate.

URL: https://tracker.ceph.com/issues/44215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Luis Henriques
1b0c3b9f91 ceph: re-org copy_file_range and fix some error paths
This patch re-organizes copy_file_range, trying to fix a few issues in the
error handling.  Here's the summary:

- Abort copy if initial do_splice_direct() returns fewer bytes than
  requested.

- Move the 'size' initialization (with i_size_read()) further down in the
  code, after the initial call to do_splice_direct().  This avoids issues
  with a possibly stale value if a manual copy is done.

- Move the object copy loop into a separate function.  This makes it
  easier to handle errors (e.g, dirtying caps and updating the MDS
  metadata if only some objects have been copied before an error has
  occurred).

- Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc
  and dst_oloc

- After the object copy loop, the new file size to be reported to the MDS
  (if there's file size change) is now the actual file size, and not the
  size after an eventual extra manual copy.

- Added a few dout() to show the number of bytes copied in the two manual
  copies and in the object copy loop.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
058daab79d ceph: move to a dedicated slabcache for mds requests
On my machine (x86_64) this struct is 952 bytes, which gets rounded up
to 1024 by kmalloc. Move this to a dedicated slabcache, so we can
allocate them without the extra 72 bytes of overhead per.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Jeff Layton
c36d641493 ceph: reorganize fields in ceph_mds_request
This shrinks the struct size by 16 bytes.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Andreas Gruenbacher
cb03c14390 ceph: switch to page_mkwrite_check_truncate in ceph_page_mkwrite
Use the "page has been truncated" logic in page_mkwrite_check_truncate
instead of reimplementing it here.  Other than with the existing code,
fail with -EFAULT / VM_FAULT_NOPAGE when page_offset(page) == size here
as well, as should be expected.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Gustavo A. R. Silva
f682dc713c ceph: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Yan, Zheng
bbb480ab05 ceph: check if file lock exists before sending unlock request
When a process exits, kernel closes its files. locks_remove_file()
is called to remove file locks on these files. locks_remove_file()
tries unlocking files even there is no file lock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Xiubo Li
5107d7d505 ceph: move ceph_osdc_{read,write}pages to ceph.ko
Since these helpers are only used by ceph.ko, move them there and
rename them with _sync_ qualifiers.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
70837470b4 ceph: don't ClearPageChecked in ceph_invalidatepage()
CephFS doesn't set this bit to begin with, so there should be no need
to clear it.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
072eaf3c0f libceph: drop CEPH_DEFINE_SHOW_FUNC
Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates
DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-03-30 12:42:40 +02:00
Yan, Zheng
525d15e8e5 ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}
These bits will have new meaning for directory inodes.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
f85122afeb ceph: add refcounting for Fx caps
In future patches we'll be taking and relying on Fx caps. Add proper
refcounting for them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Jeff Layton
3db0a2fc56 ceph: register MDS request with dir inode from the start
When the unsafe reply to a request comes in, the request is put on the
r_unsafe_dir inode's list. In future patches, we're going to need to
wait on requests that may not have gotten an unsafe reply yet.

Change __register_request to put the entry on the dir inode's list when
the pointer is set in the request, and don't check the
CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.

The only place that uses this list today is fsync codepath, and with
the coming changes, we'll want to wait on all operations whether it has
gotten an unsafe reply or not.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:39 +02:00
Luis Henriques
c8d6ee0144 ceph: fix memory leak in ceph_cleanup_snapid_map()
kmemleak reports the following memory leak:

unreferenced object 0xffff88821feac8a0 (size 96):
  comm "kworker/1:0", pid 17, jiffies 4294896362 (age 20.512s)
  hex dump (first 32 bytes):
    a0 c8 ea 1f 82 88 ff ff 00 c9 ea 1f 82 88 ff ff  ................
    00 00 00 00 00 00 00 00 00 01 00 00 00 00 ad de  ................
  backtrace:
    [<00000000b3ea77fb>] ceph_get_snapid_map+0x75/0x2a0
    [<00000000d4060942>] fill_inode+0xb26/0x1010
    [<0000000049da6206>] ceph_readdir_prepopulate+0x389/0xc40
    [<00000000e2fe2549>] dispatch+0x11ab/0x1521
    [<000000007700b894>] ceph_con_workfn+0xf3d/0x3240
    [<0000000039138a41>] process_one_work+0x24d/0x590
    [<00000000eb751f34>] worker_thread+0x4a/0x3d0
    [<000000007e8f0d42>] kthread+0xfb/0x130
    [<00000000d49bd1fa>] ret_from_fork+0x3a/0x50

A kfree is missing while looping the 'to_free' list of ceph_snapid_map
objects.

Cc: stable@vger.kernel.org
Fixes: 75c9627efb ("ceph: map snapid to anonymous bdev ID")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-23 13:07:08 +01:00
Ilya Dryomov
7614209736 ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
per-pool flags as well.  Unfortunately the backwards compatibility here
is lacking:

- the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
  was guarded by require_osd_release >= RELEASE_LUMINOUS
- it was subsequently backported to luminous in v12.2.2, but that makes
  no difference to clients that only check OSDMAP_FULL/NEARFULL because
  require_osd_release is not client-facing -- it is for OSDs

Since all kernels are affected, the best we can do here is just start
checking both map flags and pool flags and send that to stable.

These checks are best effort, so take osdc->lock and look up pool flags
just once.  Remove the FIXME, since filesystem quotas are checked above
and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

Cc: stable@vger.kernel.org
Reported-by: Yanhu Cao <gmayyyha@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Sage Weil <sage@redhat.com>
2020-03-23 13:07:08 +01:00
Xiubo Li
3b20bc2fe4 ceph: noacl mount option is effectively ignored
For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.

But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.

Fixes: 82995cc6c5 ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-02-11 17:04:40 +01:00
Ilya Dryomov
b27a939e83 ceph: canonicalize server path in place
syzbot reported that 4fbc0c711b ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.

We don't really need the supplied server path, so canonicalize it
in place and compare it directly.  To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored.  CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.

Fixes: 4fbc0c711b ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-02-11 17:04:40 +01:00
Xiubo Li
8e4473bb50 ceph: do not execute direct write in parallel if O_APPEND is specified
In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.

For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:

          Writer1                         Writer2

     shared_lock()                   shared_lock()
     getattr(CAP_SIZE)               getattr(CAP_SIZE)
     iocb->ki_pos = EOF              iocb->ki_pos = EOF
     write(data1)
                                     write(data2)
     shared_unlock()                 shared_unlock()

The data2 will overlap the data1 from the same file offset, the
old EOF.

Switch to exclusive lock instead when O_APPEND is specified.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-02-11 17:04:04 +01:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Al Viro
d53d0f7461 ceph: use errorfc() and friends instead of spelling the prefix out
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:39 -05:00
Al Viro
48ce73b1be fs_parse: handle optional arguments sanely
Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.

There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Al Viro
cc3c0b533a add prefix to fs_context->log
... turning it into struct p_log embedded into fs_context.  Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().

This makes fs_parameter_description->name completely unused.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:35 -05:00
Al Viro
c80c98f0dc ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:34 -05:00
Al Viro
5eede62529 fold struct fs_parameter_enum into struct constant_table
no real difference now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Al Viro
2710c957a8 fs_parse: get rid of ->enums
Don't do a single array; attach them to fsparam_enum() entry
instead.  And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Linus Torvalds
4c46bef2e9 We have:
- a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li).  This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.
 
 - inode initialization fixes (Jeff Layton).  The one included here
   mostly concerns things like open_by_handle() and there is another
   one that will come through Al.
 
 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.
 
 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)
 
 And a smattering of cleanups from Xiubo, Jeff and Chengguang.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl47PUcTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6LoCACmVli5N6bgnBE4sTixi/jz6aCCbk32
 ZPlKiSesHnOGkY6KXHJT58JYy0paITBRik5ypdz06J8aCOtWyPLbn3uCemF9CYn2
 g6dId2Lf5vGFrgSm4YSiqp9a86IZmYSDG41LbJD/IJWFDWdMWqNPMDqji6yaIO5O
 NJI5N0tk+VFXdV+JyjV9X/FnP1r1D2ReZzz21ZiqTJXSmE8YIkioLjkq36QTMMG7
 Gm5qdlc1x2r4qfzA1g+OiWgRQCUMgkuYerFzus4mVbW4hrphsavH2DArbOwFmsXF
 46hOq+1uGVVyZILLJfKNiktf1GExBF0icbSREJtmjUHbQvNR8BH0C+fV
 =vvIc
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:

 - a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li). This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.

 - inode initialization fixes (Jeff Layton). The one included here
   mostly concerns things like open_by_handle() and there is another one
   that will come through Al.

 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.

 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)

... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
  rbd: set the 'device' link in sysfs
  ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  ceph: print name of xattr in __ceph_{get,set}xattr() douts
  ceph: print r_direct_hash in hex in __choose_mds() dout
  ceph: use copy-from2 op in copy_file_range
  ceph: close holes in structs ceph_mds_session and ceph_mds_request
  rbd: work around -Wuninitialized warning
  ceph: allocate the correct amount of extra bytes for the session features
  ceph: rename get_session and switch to use ceph_get_mds_session
  ceph: remove the extra slashes in the server path
  ceph: add possible_max_rank and make the code more readable
  ceph: print dentry offset in hex and fix xattr_version type
  ceph: only touch the caps which have the subset mask requested
  ceph: don't clear I_NEW until inode metadata is fully populated
  ceph: retry the same mds later after the new session is opened
  ceph: check availability of mds cluster on mount after wait timeout
  ceph: keep the session state until it is released
  ceph: add __send_request helper
  ceph: ensure we have a new cap before continuing in fill_inode
  ceph: drop unused ttl_from parameter from fill_inode
  ...
2020-02-06 12:21:01 +00:00
Linus Torvalds
bddea11b1b Merge branch 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs timestamp updates from Al Viro:
 "More 64bit timestamp work"

* 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: don't bother with timestamp truncation
  fs: Do not overload update_time
  fs: Delete timespec64_trunc()
  fs: ubifs: Eliminate timespec64_trunc() usage
  fs: ceph: Delete timespec64_trunc() usage
  fs: cifs: Delete usage of timespec64_trunc
  fs: fat: Eliminate timespec64_trunc() usage
  utimes: Clamp the timestamps in notify_change()
2020-02-05 05:02:42 +00:00
Jeff Layton
24604f7e2b ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
All of these functions are only called from CephFS, so move them into
ceph.ko, and drop the exports.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Jeff Layton
d36e0b620a ceph: print name of xattr in __ceph_{get,set}xattr() douts
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
3c802092da ceph: print r_direct_hash in hex in __choose_mds() dout
It's hard to read, especially when it is:

  ceph:  __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0

At the same time, switch to __func__ to get rid of the checkpatch
warning.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Luis Henriques
78beb0ff2f ceph: use copy-from2 op in copy_file_range
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.

If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP.  In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Jeff Layton
045100cd79 ceph: close holes in structs ceph_mds_session and ceph_mds_request
Move s_ref up to plug a 4 byte hole, which plugs another.
Move r_kref to shave 8 bytes off per request on x86_64.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
9ba1e22453 ceph: allocate the correct amount of extra bytes for the session features
The total bytes may potentially be larger than 8.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
5b3248c677 ceph: rename get_session and switch to use ceph_get_mds_session
Just in case the session's refcount reach 0 and is releasing, and
if we get the session without checking it, we may encounter kernel
crash.

Rename get_session to ceph_get_mds_session and make it global.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
4fbc0c711b ceph: remove the extra slashes in the server path
It's possible to pass the mount helper a server path that has more
than one contiguous slash character. For example:

  $ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/

In the MDS server side the extra slashes of the server path will be
treated as snap dir, and then we can get the following debug logs:

  ceph:  mount opening path //
  ceph:  open_root_inode opening '//'
  ceph:  fill_trace 0000000059b8a3bc is_dentry 0 is_target 1
  ceph:  alloc_inode 00000000dc4ca00b
  ceph:  get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1
  ceph:  get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00b

And then when creating any new file or directory under the mount
point, we can hit the following BUG_ON in ceph_fill_trace():

  BUG_ON(ceph_snap(dir) != dvino.snap);

Have the client ignore the extra slashes in the server path when
mounting. This will also canonicalize the path, so that identical mounts
can be consilidated.

1) "//mydir1///mydir//"
2) "/mydir1/mydir"
3) "/mydir1/mydir/"

Regardless of the internal treatment of these paths, the kernel still
stores the original string including the leading '/' for presentation
to userland.

URL: https://tracker.ceph.com/issues/42771
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
b38c9eb475 ceph: add possible_max_rank and make the code more readable
The m_num_mds here is actually the number for MDSs which are in
up:active status, and it will be duplicated to m_num_active_mds,
so remove it.

Add possible_max_rank to the mdsmap struct and this will be
the correctly possible largest rank boundary.

Remove the special case for one mds in __mdsmap_get_random_mds(),
because the validate mds rank may not always be 0.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Xiubo Li
0eb308531f ceph: print dentry offset in hex and fix xattr_version type
In the debug logs about the di->offset or ctx->pos it is in hex
format, but some others are using the dec format. It is a little
hard to read.

For the xattr version, it is u64 type, using a shorter type may
truncate it.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Xiubo Li
9f8b72b3a9 ceph: only touch the caps which have the subset mask requested
For the caps having no any subset mask requested we shouldn't touch
them.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Jeff Layton
893e456b2c ceph: don't clear I_NEW until inode metadata is fully populated
Currently, we could have an open-by-handle (or NFS server) call
into the filesystem and start working with an inode before it's
properly filled out.

Don't clear I_NEW until we have filled out the inode, and discard it
properly if that fails. Note that we occasionally take an extra
reference to the inode to ensure that we don't put the last reference in
discard_new_inode, but rather leave it for ceph_async_iput.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Xiubo Li
c4853e9776 ceph: retry the same mds later after the new session is opened
If max_mds > 1 and a request is submitted that chooses a random mds
rank, and the relating session is not opened yet, the request will wait
until the session has been opened and resend again.

Every time the request goes through __do_request, it will release the
req->session first and choose a random one again, which may be a
completely different rank than the one it just waited on.

In the worst case, it will open all the mds sessions one by one just
before the request can be successfully sent out.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Xiubo Li
97820058fb ceph: check availability of mds cluster on mount after wait timeout
If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out.  A mount
attempt will also fail with EIO if all of the MDS's are laggy.

This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.

URL: https://tracker.ceph.com/issues/4386
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Xiubo Li
4d681c2f91 ceph: keep the session state until it is released
When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:

"ceph:  mds%d reconnect denied"

And with the confusing log:

"ceph:  handle_session mds0 close 0000000085804730 state ??? seq 0"

Let's keep the session state until its memories is released.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Xiubo Li
9cf54563b0 ceph: add __send_request helper
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Jeff Layton
9a6bed4fe0 ceph: ensure we have a new cap before continuing in fill_inode
If the caller passes in a NULL cap_reservation, and we can't allocate
one then ensure that we fail gracefully.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00
Jeff Layton
57c2199482 ceph: drop unused ttl_from parameter from fill_inode
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:39 +01:00