License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
// SPDX-License-Identifier: GPL-2.0
2010-04-06 15:14:15 -07:00
# include <linux/ceph/ceph_debug.h>
2009-10-06 11:31:09 -07:00
2010-09-22 19:57:10 -07:00
# include <linux/fs.h>
2009-10-06 11:31:09 -07:00
# include <linux/wait.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2014-03-29 13:41:15 +08:00
# include <linux/gfp.h>
2009-10-06 11:31:09 -07:00
# include <linux/sched.h>
2010-04-06 15:14:15 -07:00
# include <linux/debugfs.h>
# include <linux/seq_file.h>
2015-05-22 16:38:02 +08:00
# include <linux/ratelimit.h>
2020-01-08 05:17:31 -05:00
# include <linux/bits.h>
2020-03-19 23:45:02 -04:00
# include <linux/ktime.h>
2021-08-18 09:31:19 +08:00
# include <linux/bitmap.h>
2023-08-07 15:26:16 +02:00
# include <linux/mnt_idmapping.h>
2009-10-06 11:31:09 -07:00
# include "super.h"
2010-04-06 15:14:15 -07:00
# include "mds_client.h"
2020-07-27 10:16:09 -04:00
# include "crypto.h"
2010-04-06 15:14:15 -07:00
2012-07-30 16:23:22 -07:00
# include <linux/ceph/ceph_features.h>
2010-04-06 15:14:15 -07:00
# include <linux/ceph/messenger.h>
# include <linux/ceph/decode.h>
# include <linux/ceph/pagelist.h>
# include <linux/ceph/auth.h>
# include <linux/ceph/debugfs.h>
2009-10-06 11:31:09 -07:00
2019-01-01 16:28:33 +08:00
# define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
2009-10-06 11:31:09 -07:00
/*
* A cluster of MDS ( metadata server ) daemons is responsible for
* managing the file system namespace ( the directory hierarchy and
* inodes ) and for coordinating shared access to storage . Metadata is
* partitioning hierarchically across a number of servers , and that
* partition varies over time as the cluster adjusts the distribution
* in order to balance load .
*
* The MDS client is primarily responsible to managing synchronous
* metadata requests for operations like open , unlink , and so forth .
* If there is a MDS failure , we find out about it when we ( possibly
* request and ) receive a new MDS map , and can resubmit affected
* requests .
*
* For the most part , though , we take advantage of a lossless
* communications channel to the MDS , and do not need to worry about
* timing out or resubmitting requests .
*
* We maintain a stateful " session " with each MDS we interact with .
* Within each session , we sent periodic heartbeat messages to ensure
* any capabilities or leases we have been issues remain valid . If
* the session times out and goes stale , our leases and capabilities
* are no longer valid .
*/
2010-05-12 15:21:32 -07:00
struct ceph_reconnect_state {
2019-01-01 16:28:33 +08:00
struct ceph_mds_session * session ;
int nr_caps , nr_realms ;
2010-05-12 15:21:32 -07:00
struct ceph_pagelist * pagelist ;
2016-07-04 22:05:18 +08:00
unsigned msg_version ;
2019-01-01 16:28:33 +08:00
bool allow_multi ;
2010-05-12 15:21:32 -07:00
} ;
2009-10-06 11:31:09 -07:00
static void __wake_requests ( struct ceph_mds_client * mdsc ,
struct list_head * head ) ;
2019-01-14 17:21:19 +08:00
static void ceph_cap_release_work ( struct work_struct * work ) ;
2019-01-31 16:55:51 +08:00
static void ceph_cap_reclaim_work ( struct work_struct * work ) ;
2009-10-06 11:31:09 -07:00
2010-05-20 10:40:19 +02:00
static const struct ceph_connection_operations mds_con_ops ;
2009-10-06 11:31:09 -07:00
/*
* mds reply parsing
*/
2019-01-09 10:10:17 +08:00
static int parse_reply_info_quota ( void * * p , void * end ,
struct ceph_mds_reply_info_in * info )
{
u8 struct_v , struct_compat ;
u32 struct_len ;
ceph_decode_8_safe ( p , end , struct_v , bad ) ;
ceph_decode_8_safe ( p , end , struct_compat , bad ) ;
/* struct_v is expected to be >= 1. we only
* understand encoding with struct_compat = = 1. */
if ( ! struct_v | | struct_compat ! = 1 )
goto bad ;
ceph_decode_32_safe ( p , end , struct_len , bad ) ;
ceph_decode_need ( p , end , struct_len , bad ) ;
end = * p + struct_len ;
ceph_decode_64_safe ( p , end , info - > max_bytes , bad ) ;
ceph_decode_64_safe ( p , end , info - > max_files , bad ) ;
* p = end ;
return 0 ;
bad :
return - EIO ;
}
2009-10-06 11:31:09 -07:00
/*
* parse individual inode info
*/
static int parse_reply_info_in ( void * * p , void * end ,
2010-12-14 17:37:52 -08:00
struct ceph_mds_reply_info_in * info ,
2013-12-24 21:19:23 +02:00
u64 features )
2009-10-06 11:31:09 -07:00
{
2019-01-09 10:10:17 +08:00
int err = 0 ;
u8 struct_v = 0 ;
2009-10-06 11:31:09 -07:00
2019-01-09 10:10:17 +08:00
if ( features = = ( u64 ) - 1 ) {
u32 struct_len ;
u8 struct_compat ;
ceph_decode_8_safe ( p , end , struct_v , bad ) ;
ceph_decode_8_safe ( p , end , struct_compat , bad ) ;
/* struct_v is expected to be >= 1. we only understand
* encoding with struct_compat = = 1. */
if ( ! struct_v | | struct_compat ! = 1 )
goto bad ;
ceph_decode_32_safe ( p , end , struct_len , bad ) ;
ceph_decode_need ( p , end , struct_len , bad ) ;
end = * p + struct_len ;
}
ceph_decode_need ( p , end , sizeof ( struct ceph_mds_reply_inode ) , bad ) ;
2009-10-06 11:31:09 -07:00
info - > in = * p ;
* p + = sizeof ( struct ceph_mds_reply_inode ) +
sizeof ( * info - > in - > fragtree . splits ) *
le32_to_cpu ( info - > in - > fragtree . nsplits ) ;
ceph_decode_32_safe ( p , end , info - > symlink_len , bad ) ;
ceph_decode_need ( p , end , info - > symlink_len , bad ) ;
info - > symlink = * p ;
* p + = info - > symlink_len ;
2018-11-08 14:55:21 +01:00
ceph_decode_copy_safe ( p , end , & info - > dir_layout ,
sizeof ( info - > dir_layout ) , bad ) ;
2009-10-06 11:31:09 -07:00
ceph_decode_32_safe ( p , end , info - > xattr_len , bad ) ;
ceph_decode_need ( p , end , info - > xattr_len , bad ) ;
info - > xattr_data = * p ;
* p + = info - > xattr_len ;
2014-11-14 21:29:55 +08:00
2019-01-09 10:10:17 +08:00
if ( features = = ( u64 ) - 1 ) {
/* inline data */
2014-11-14 21:29:55 +08:00
ceph_decode_64_safe ( p , end , info - > inline_version , bad ) ;
ceph_decode_32_safe ( p , end , info - > inline_len , bad ) ;
ceph_decode_need ( p , end , info - > inline_len , bad ) ;
info - > inline_data = * p ;
* p + = info - > inline_len ;
2019-01-09 10:10:17 +08:00
/* quota */
err = parse_reply_info_quota ( p , end , info ) ;
if ( err < 0 )
goto out_bad ;
/* pool namespace */
ceph_decode_32_safe ( p , end , info - > pool_ns_len , bad ) ;
if ( info - > pool_ns_len > 0 ) {
ceph_decode_need ( p , end , info - > pool_ns_len , bad ) ;
info - > pool_ns_data = * p ;
* p + = info - > pool_ns_len ;
}
2019-05-29 11:19:42 -04:00
/* btime */
ceph_decode_need ( p , end , sizeof ( info - > btime ) , bad ) ;
ceph_decode_copy ( p , & info - > btime , sizeof ( info - > btime ) ) ;
/* change attribute */
2019-06-06 07:29:23 -04:00
ceph_decode_64_safe ( p , end , info - > change_attr , bad ) ;
2014-11-14 21:29:55 +08:00
2019-01-09 11:07:02 +08:00
/* dir pin */
if ( struct_v > = 2 ) {
ceph_decode_32_safe ( p , end , info - > dir_pin , bad ) ;
} else {
info - > dir_pin = - ENODATA ;
}
2019-04-18 14:15:46 +02:00
/* snapshot birth time, remains zero for v<=2 */
if ( struct_v > = 3 ) {
ceph_decode_need ( p , end , sizeof ( info - > snap_btime ) , bad ) ;
ceph_decode_copy ( p , & info - > snap_btime ,
sizeof ( info - > snap_btime ) ) ;
} else {
memset ( & info - > snap_btime , 0 , sizeof ( info - > snap_btime ) ) ;
}
2020-08-28 09:28:44 +08:00
/* snapshot count, remains zero for v<=3 */
if ( struct_v > = 4 ) {
ceph_decode_64_safe ( p , end , info - > rsnaps , bad ) ;
} else {
info - > rsnaps = 0 ;
}
2020-07-27 10:16:09 -04:00
if ( struct_v > = 5 ) {
u32 alen ;
ceph_decode_32_safe ( p , end , alen , bad ) ;
while ( alen - - ) {
u32 len ;
/* key */
ceph_decode_32_safe ( p , end , len , bad ) ;
ceph_decode_skip_n ( p , end , len , bad ) ;
/* value */
ceph_decode_32_safe ( p , end , len , bad ) ;
ceph_decode_skip_n ( p , end , len , bad ) ;
}
}
/* fscrypt flag -- ignore */
if ( struct_v > = 6 )
ceph_decode_skip_8 ( p , end , bad ) ;
info - > fscrypt_auth = NULL ;
info - > fscrypt_auth_len = 0 ;
info - > fscrypt_file = NULL ;
info - > fscrypt_file_len = 0 ;
if ( struct_v > = 7 ) {
ceph_decode_32_safe ( p , end , info - > fscrypt_auth_len , bad ) ;
if ( info - > fscrypt_auth_len ) {
info - > fscrypt_auth = kmalloc ( info - > fscrypt_auth_len ,
GFP_KERNEL ) ;
if ( ! info - > fscrypt_auth )
return - ENOMEM ;
ceph_decode_copy_safe ( p , end , info - > fscrypt_auth ,
info - > fscrypt_auth_len , bad ) ;
}
ceph_decode_32_safe ( p , end , info - > fscrypt_file_len , bad ) ;
if ( info - > fscrypt_file_len ) {
info - > fscrypt_file = kmalloc ( info - > fscrypt_file_len ,
GFP_KERNEL ) ;
if ( ! info - > fscrypt_file )
return - ENOMEM ;
ceph_decode_copy_safe ( p , end , info - > fscrypt_file ,
info - > fscrypt_file_len , bad ) ;
}
}
2019-01-09 10:10:17 +08:00
* p = end ;
} else {
2020-07-27 10:16:09 -04:00
/* legacy (unversioned) struct */
2019-01-09 10:10:17 +08:00
if ( features & CEPH_FEATURE_MDS_INLINE_DATA ) {
ceph_decode_64_safe ( p , end , info - > inline_version , bad ) ;
ceph_decode_32_safe ( p , end , info - > inline_len , bad ) ;
ceph_decode_need ( p , end , info - > inline_len , bad ) ;
info - > inline_data = * p ;
* p + = info - > inline_len ;
} else
info - > inline_version = CEPH_INLINE_NONE ;
if ( features & CEPH_FEATURE_MDS_QUOTA ) {
err = parse_reply_info_quota ( p , end , info ) ;
if ( err < 0 )
goto out_bad ;
} else {
info - > max_bytes = 0 ;
info - > max_files = 0 ;
}
info - > pool_ns_len = 0 ;
info - > pool_ns_data = NULL ;
if ( features & CEPH_FEATURE_FS_FILE_LAYOUT_V2 ) {
ceph_decode_32_safe ( p , end , info - > pool_ns_len , bad ) ;
if ( info - > pool_ns_len > 0 ) {
ceph_decode_need ( p , end , info - > pool_ns_len , bad ) ;
info - > pool_ns_data = * p ;
* p + = info - > pool_ns_len ;
}
}
2019-01-09 11:07:02 +08:00
2019-05-29 11:19:42 -04:00
if ( features & CEPH_FEATURE_FS_BTIME ) {
ceph_decode_need ( p , end , sizeof ( info - > btime ) , bad ) ;
ceph_decode_copy ( p , & info - > btime , sizeof ( info - > btime ) ) ;
2019-06-06 07:29:23 -04:00
ceph_decode_64_safe ( p , end , info - > change_attr , bad ) ;
2019-05-29 11:19:42 -04:00
}
2019-01-09 11:07:02 +08:00
info - > dir_pin = - ENODATA ;
2020-08-28 09:28:44 +08:00
/* info->snap_btime and info->rsnaps remain zero */
2019-01-09 10:10:17 +08:00
}
return 0 ;
bad :
err = - EIO ;
out_bad :
return err ;
}
static int parse_reply_info_dir ( void * * p , void * end ,
struct ceph_mds_reply_dirfrag * * dirfrag ,
u64 features )
{
if ( features = = ( u64 ) - 1 ) {
2018-01-05 10:47:18 +00:00
u8 struct_v , struct_compat ;
u32 struct_len ;
ceph_decode_8_safe ( p , end , struct_v , bad ) ;
ceph_decode_8_safe ( p , end , struct_compat , bad ) ;
2019-01-09 10:10:17 +08:00
/* struct_v is expected to be >= 1. we only understand
* encoding whose struct_compat = = 1. */
if ( ! struct_v | | struct_compat ! = 1 )
2018-01-05 10:47:18 +00:00
goto bad ;
ceph_decode_32_safe ( p , end , struct_len , bad ) ;
ceph_decode_need ( p , end , struct_len , bad ) ;
2019-01-09 10:10:17 +08:00
end = * p + struct_len ;
2018-01-05 10:47:18 +00:00
}
2019-01-09 10:10:17 +08:00
ceph_decode_need ( p , end , sizeof ( * * dirfrag ) , bad ) ;
* dirfrag = * p ;
* p + = sizeof ( * * dirfrag ) + sizeof ( u32 ) * le32_to_cpu ( ( * dirfrag ) - > ndist ) ;
if ( unlikely ( * p > end ) )
goto bad ;
if ( features = = ( u64 ) - 1 )
* p = end ;
return 0 ;
bad :
return - EIO ;
}
static int parse_reply_info_lease ( void * * p , void * end ,
struct ceph_mds_reply_lease * * lease ,
2021-01-11 12:35:48 -05:00
u64 features , u32 * altname_len , u8 * * altname )
2019-01-09 10:10:17 +08:00
{
2021-01-11 12:35:48 -05:00
u8 struct_v ;
u32 struct_len ;
void * lend ;
2019-01-09 10:10:17 +08:00
if ( features = = ( u64 ) - 1 ) {
2021-01-11 12:35:48 -05:00
u8 struct_compat ;
2019-01-09 10:10:17 +08:00
ceph_decode_8_safe ( p , end , struct_v , bad ) ;
ceph_decode_8_safe ( p , end , struct_compat , bad ) ;
2021-01-11 12:35:48 -05:00
2019-01-09 10:10:17 +08:00
/* struct_v is expected to be >= 1. we only understand
* encoding whose struct_compat = = 1. */
if ( ! struct_v | | struct_compat ! = 1 )
goto bad ;
2021-01-11 12:35:48 -05:00
2019-01-09 10:10:17 +08:00
ceph_decode_32_safe ( p , end , struct_len , bad ) ;
2021-01-11 12:35:48 -05:00
} else {
struct_len = sizeof ( * * lease ) ;
* altname_len = 0 ;
* altname = NULL ;
2016-02-14 18:06:41 +08:00
}
2021-01-11 12:35:48 -05:00
lend = * p + struct_len ;
ceph_decode_need ( p , end , struct_len , bad ) ;
2019-01-09 10:10:17 +08:00
* lease = * p ;
* p + = sizeof ( * * lease ) ;
2021-01-11 12:35:48 -05:00
if ( features = = ( u64 ) - 1 ) {
if ( struct_v > = 2 ) {
ceph_decode_32_safe ( p , end , * altname_len , bad ) ;
ceph_decode_need ( p , end , * altname_len , bad ) ;
* altname = * p ;
* p + = * altname_len ;
} else {
* altname = NULL ;
* altname_len = 0 ;
}
}
* p = lend ;
2009-10-06 11:31:09 -07:00
return 0 ;
bad :
2019-01-09 10:10:17 +08:00
return - EIO ;
2009-10-06 11:31:09 -07:00
}
/*
* parse a normal reply , which may contain a ( dir + ) dentry and / or a
* target inode .
*/
static int parse_reply_info_trace ( void * * p , void * end ,
2010-12-14 17:37:52 -08:00
struct ceph_mds_reply_info_parsed * info ,
2013-12-24 21:19:23 +02:00
u64 features )
2009-10-06 11:31:09 -07:00
{
int err ;
if ( info - > head - > is_dentry ) {
2010-12-14 17:37:52 -08:00
err = parse_reply_info_in ( p , end , & info - > diri , features ) ;
2009-10-06 11:31:09 -07:00
if ( err < 0 )
goto out_bad ;
2019-01-09 10:10:17 +08:00
err = parse_reply_info_dir ( p , end , & info - > dirfrag , features ) ;
if ( err < 0 )
goto out_bad ;
2009-10-06 11:31:09 -07:00
ceph_decode_32_safe ( p , end , info - > dname_len , bad ) ;
ceph_decode_need ( p , end , info - > dname_len , bad ) ;
info - > dname = * p ;
* p + = info - > dname_len ;
2019-01-09 10:10:17 +08:00
2021-01-11 12:35:48 -05:00
err = parse_reply_info_lease ( p , end , & info - > dlease , features ,
& info - > altname_len , & info - > altname ) ;
2019-01-09 10:10:17 +08:00
if ( err < 0 )
goto out_bad ;
2009-10-06 11:31:09 -07:00
}
if ( info - > head - > is_target ) {
2010-12-14 17:37:52 -08:00
err = parse_reply_info_in ( p , end , & info - > targeti , features ) ;
2009-10-06 11:31:09 -07:00
if ( err < 0 )
goto out_bad ;
}
if ( unlikely ( * p ! = end ) )
goto bad ;
return 0 ;
bad :
err = - EIO ;
out_bad :
pr_err ( " problem parsing mds trace %d \n " , err ) ;
return err ;
}
/*
* parse readdir results
*/
2019-01-09 10:10:17 +08:00
static int parse_reply_info_readdir ( void * * p , void * end ,
2022-03-14 10:28:34 +08:00
struct ceph_mds_request * req ,
u64 features )
2009-10-06 11:31:09 -07:00
{
2022-03-14 10:28:34 +08:00
struct ceph_mds_reply_info_parsed * info = & req - > r_reply_info ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = req - > r_mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
u32 num , i = 0 ;
int err ;
2019-01-09 10:10:17 +08:00
err = parse_reply_info_dir ( p , end , & info - > dir_dir , features ) ;
if ( err < 0 )
goto out_bad ;
2009-10-06 11:31:09 -07:00
ceph_decode_need ( p , end , sizeof ( num ) + 2 , bad ) ;
2009-10-14 09:59:09 -07:00
num = ceph_decode_32 ( p ) ;
2016-04-27 17:48:30 +08:00
{
u16 flags = ceph_decode_16 ( p ) ;
info - > dir_end = ! ! ( flags & CEPH_READDIR_FRAG_END ) ;
info - > dir_complete = ! ! ( flags & CEPH_READDIR_FRAG_COMPLETE ) ;
2016-04-29 11:27:30 +08:00
info - > hash_order = ! ! ( flags & CEPH_READDIR_HASH_ORDER ) ;
2017-04-05 12:54:05 -04:00
info - > offset_hash = ! ! ( flags & CEPH_READDIR_OFFSET_HASH ) ;
2016-04-27 17:48:30 +08:00
}
2009-10-06 11:31:09 -07:00
if ( num = = 0 )
goto done ;
2016-04-28 09:37:39 +08:00
BUG_ON ( ! info - > dir_entries ) ;
if ( ( unsigned long ) ( info - > dir_entries + num ) >
( unsigned long ) info - > dir_entries + info - > dir_buf_size ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " dir contents are larger than expected \n " ) ;
2014-03-29 13:41:15 +08:00
WARN_ON ( 1 ) ;
goto bad ;
}
2009-10-06 11:31:09 -07:00
2014-03-29 13:41:15 +08:00
info - > dir_nr = num ;
2009-10-06 11:31:09 -07:00
while ( num ) {
2022-03-14 10:28:35 +08:00
struct inode * inode = d_inode ( req - > r_dentry ) ;
struct ceph_inode_info * ci = ceph_inode ( inode ) ;
2016-04-28 09:37:39 +08:00
struct ceph_mds_reply_dir_entry * rde = info - > dir_entries + i ;
2022-03-14 10:28:35 +08:00
struct fscrypt_str tname = FSTR_INIT ( NULL , 0 ) ;
struct fscrypt_str oname = FSTR_INIT ( NULL , 0 ) ;
struct ceph_fname fname ;
u32 altname_len , _name_len ;
u8 * altname , * _name ;
2009-10-06 11:31:09 -07:00
/* dentry */
2022-03-14 10:28:35 +08:00
ceph_decode_32_safe ( p , end , _name_len , bad ) ;
ceph_decode_need ( p , end , _name_len , bad ) ;
_name = * p ;
* p + = _name_len ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " parsed dir dname '%.*s' \n " , _name_len , _name ) ;
2022-03-14 10:28:35 +08:00
if ( info - > hash_order )
rde - > raw_hash = ceph_str_hash ( ci - > i_dir_layout . dl_dir_hash ,
_name , _name_len ) ;
2009-10-06 11:31:09 -07:00
2019-01-09 10:10:17 +08:00
/* dentry lease */
2021-01-11 12:35:48 -05:00
err = parse_reply_info_lease ( p , end , & rde - > lease , features ,
2022-03-14 10:28:35 +08:00
& altname_len , & altname ) ;
2019-01-09 10:10:17 +08:00
if ( err )
goto out_bad ;
2021-01-11 12:35:48 -05:00
2022-03-14 10:28:35 +08:00
/*
* Try to dencrypt the dentry names and update them
* in the ceph_mds_reply_dir_entry struct .
*/
fname . dir = inode ;
fname . name = _name ;
fname . name_len = _name_len ;
fname . ctext = altname ;
fname . ctext_len = altname_len ;
/*
* The _name_len maybe larger than altname_len , such as
* when the human readable name length is in range of
* ( CEPH_NOHASH_NAME_MAX , CEPH_NOHASH_NAME_MAX + SHA256_DIGEST_SIZE ) ,
* then the copy in ceph_fname_to_usr will corrupt the
* data if there has no encryption key .
*
* Just set the no_copy flag and then if there has no
* encryption key the oname . name will be assigned to
* _name always .
*/
fname . no_copy = true ;
if ( altname_len = = 0 ) {
/*
* Set tname to _name , and this will be used
* to do the base64_decode in - place . It ' s
* safe because the decoded string should
* always be shorter , which is 3 / 4 of origin
* string .
*/
tname . name = _name ;
/*
* Set oname to _name too , and this will be
* used to do the dencryption in - place .
*/
oname . name = _name ;
oname . len = _name_len ;
} else {
/*
* This will do the decryption only in - place
* from altname cryptext directly .
*/
oname . name = altname ;
oname . len = altname_len ;
}
rde - > is_nokey = false ;
err = ceph_fname_to_usr ( & fname , & tname , & oname , & rde - > is_nokey ) ;
if ( err ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " unable to decode %.*s, got %d \n " ,
_name_len , _name , err ) ;
2022-03-14 10:28:35 +08:00
goto out_bad ;
}
rde - > name = oname . name ;
rde - > name_len = oname . len ;
2009-10-06 11:31:09 -07:00
/* inode */
2016-04-28 09:37:39 +08:00
err = parse_reply_info_in ( p , end , & rde - > inode , features ) ;
2009-10-06 11:31:09 -07:00
if ( err < 0 )
goto out_bad ;
2016-04-28 15:17:40 +08:00
/* ceph_readdir_prepopulate() will update it */
rde - > offset = 0 ;
2009-10-06 11:31:09 -07:00
i + + ;
num - - ;
}
done :
2019-09-26 16:05:11 -04:00
/* Skip over any unrecognized fields */
* p = end ;
2009-10-06 11:31:09 -07:00
return 0 ;
bad :
err = - EIO ;
out_bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " problem parsing dir contents %d \n " , err ) ;
2009-10-06 11:31:09 -07:00
return err ;
}
2010-12-01 14:14:38 -08:00
/*
* parse fcntl F_GETLK results
*/
static int parse_reply_info_filelock ( void * * p , void * end ,
2010-12-14 17:37:52 -08:00
struct ceph_mds_reply_info_parsed * info ,
2013-12-24 21:19:23 +02:00
u64 features )
2010-12-01 14:14:38 -08:00
{
if ( * p + sizeof ( * info - > filelock_reply ) > end )
goto bad ;
info - > filelock_reply = * p ;
2019-09-26 16:05:11 -04:00
/* Skip over any unrecognized fields */
* p = end ;
2010-12-01 14:14:38 -08:00
return 0 ;
bad :
return - EIO ;
}
2019-11-15 11:51:55 -05:00
# if BITS_PER_LONG == 64
# define DELEGATED_INO_AVAILABLE xa_mk_value(1)
static int ceph_parse_deleg_inos ( void * * p , void * end ,
struct ceph_mds_session * s )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = s - > s_mdsc - > fsc - > client ;
2019-11-15 11:51:55 -05:00
u32 sets ;
ceph_decode_32_safe ( p , end , sets , bad ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " got %u sets of delegated inodes \n " , sets ) ;
2019-11-15 11:51:55 -05:00
while ( sets - - ) {
2022-05-18 09:55:08 +01:00
u64 start , len ;
2019-11-15 11:51:55 -05:00
ceph_decode_64_safe ( p , end , start , bad ) ;
ceph_decode_64_safe ( p , end , len , bad ) ;
2021-04-01 13:55:11 -04:00
/* Don't accept a delegation of system inodes */
if ( start < CEPH_INO_SYSTEM_BASE ) {
2023-06-12 09:04:07 +08:00
pr_warn_ratelimited_client ( cl ,
" ignoring reserved inode range delegation (start=0x%llx len=0x%llx) \n " ,
start , len ) ;
2021-04-01 13:55:11 -04:00
continue ;
}
2019-11-15 11:51:55 -05:00
while ( len - - ) {
2022-05-18 09:55:08 +01:00
int err = xa_insert ( & s - > s_delegated_inos , start + + ,
2019-11-15 11:51:55 -05:00
DELEGATED_INO_AVAILABLE ,
GFP_KERNEL ) ;
if ( ! err ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " added delegated inode 0x%llx \n " , start - 1 ) ;
2019-11-15 11:51:55 -05:00
} else if ( err = = - EBUSY ) {
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl ,
" MDS delegated inode 0x%llx more than once. \n " ,
2019-11-15 11:51:55 -05:00
start - 1 ) ;
} else {
return err ;
}
}
}
return 0 ;
bad :
return - EIO ;
}
u64 ceph_get_deleg_ino ( struct ceph_mds_session * s )
{
unsigned long ino ;
void * val ;
xa_for_each ( & s - > s_delegated_inos , ino , val ) {
val = xa_erase ( & s - > s_delegated_inos , ino ) ;
if ( val = = DELEGATED_INO_AVAILABLE )
return ino ;
}
return 0 ;
}
int ceph_restore_deleg_ino ( struct ceph_mds_session * s , u64 ino )
{
return xa_insert ( & s - > s_delegated_inos , ino , DELEGATED_INO_AVAILABLE ,
GFP_KERNEL ) ;
}
# else /* BITS_PER_LONG == 64 */
/*
* FIXME : xarrays can ' t handle 64 - bit indexes on a 32 - bit arch . For now , just
* ignore delegated_inos on 32 bit arch . Maybe eventually add xarrays for top
* and bottom words ?
*/
static int ceph_parse_deleg_inos ( void * * p , void * end ,
struct ceph_mds_session * s )
{
u32 sets ;
ceph_decode_32_safe ( p , end , sets , bad ) ;
if ( sets )
ceph_decode_skip_n ( p , end , sets * 2 * sizeof ( __le64 ) , bad ) ;
return 0 ;
bad :
return - EIO ;
}
u64 ceph_get_deleg_ino ( struct ceph_mds_session * s )
{
return 0 ;
}
int ceph_restore_deleg_ino ( struct ceph_mds_session * s , u64 ino )
{
return 0 ;
}
# endif /* BITS_PER_LONG == 64 */
2012-12-28 09:56:46 -08:00
/*
* parse create results
*/
static int parse_reply_info_create ( void * * p , void * end ,
struct ceph_mds_reply_info_parsed * info ,
2019-11-15 11:51:55 -05:00
u64 features , struct ceph_mds_session * s )
2012-12-28 09:56:46 -08:00
{
2019-11-15 11:51:55 -05:00
int ret ;
2019-01-09 10:10:17 +08:00
if ( features = = ( u64 ) - 1 | |
( features & CEPH_FEATURE_REPLY_CREATE_INODE ) ) {
2012-12-28 09:56:46 -08:00
if ( * p = = end ) {
2019-11-15 11:51:55 -05:00
/* Malformed reply? */
2012-12-28 09:56:46 -08:00
info - > has_create_ino = false ;
2019-11-15 11:51:55 -05:00
} else if ( test_bit ( CEPHFS_FEATURE_DELEG_INO , & s - > s_features ) ) {
2012-12-28 09:56:46 -08:00
info - > has_create_ino = true ;
2020-09-29 19:32:19 -04:00
/* struct_v, struct_compat, and len */
ceph_decode_skip_n ( p , end , 2 + sizeof ( u32 ) , bad ) ;
2019-11-15 11:51:55 -05:00
ceph_decode_64_safe ( p , end , info - > ino , bad ) ;
ret = ceph_parse_deleg_inos ( p , end , s ) ;
if ( ret )
return ret ;
} else {
/* legacy */
2019-09-26 16:05:11 -04:00
ceph_decode_64_safe ( p , end , info - > ino , bad ) ;
2019-11-15 11:51:55 -05:00
info - > has_create_ino = true ;
2012-12-28 09:56:46 -08:00
}
2019-09-26 16:05:11 -04:00
} else {
if ( * p ! = end )
goto bad ;
2012-12-28 09:56:46 -08:00
}
2019-09-26 16:05:11 -04:00
/* Skip over any unrecognized fields */
* p = end ;
2012-12-28 09:56:46 -08:00
return 0 ;
bad :
return - EIO ;
}
2022-02-14 05:01:01 +00:00
static int parse_reply_info_getvxattr ( void * * p , void * end ,
struct ceph_mds_reply_info_parsed * info ,
u64 features )
{
u32 value_len ;
ceph_decode_skip_8 ( p , end , bad ) ; /* skip current version: 1 */
ceph_decode_skip_8 ( p , end , bad ) ; /* skip first version: 1 */
ceph_decode_skip_32 ( p , end , bad ) ; /* skip payload length */
ceph_decode_32_safe ( p , end , value_len , bad ) ;
if ( value_len = = end - * p ) {
info - > xattr_info . xattr_value = * p ;
info - > xattr_info . xattr_value_len = value_len ;
* p = end ;
return value_len ;
}
bad :
return - EIO ;
}
2010-12-01 14:14:38 -08:00
/*
* parse extra results
*/
static int parse_reply_info_extra ( void * * p , void * end ,
2022-03-14 10:28:34 +08:00
struct ceph_mds_request * req ,
2019-11-15 11:51:55 -05:00
u64 features , struct ceph_mds_session * s )
2010-12-01 14:14:38 -08:00
{
2022-03-14 10:28:34 +08:00
struct ceph_mds_reply_info_parsed * info = & req - > r_reply_info ;
2017-01-12 14:42:41 -05:00
u32 op = le32_to_cpu ( info - > head - > op ) ;
if ( op = = CEPH_MDS_OP_GETFILELOCK )
2010-12-14 17:37:52 -08:00
return parse_reply_info_filelock ( p , end , info , features ) ;
2017-01-12 14:42:41 -05:00
else if ( op = = CEPH_MDS_OP_READDIR | | op = = CEPH_MDS_OP_LSSNAP )
2022-03-14 10:28:34 +08:00
return parse_reply_info_readdir ( p , end , req , features ) ;
2017-01-12 14:42:41 -05:00
else if ( op = = CEPH_MDS_OP_CREATE )
2019-11-15 11:51:55 -05:00
return parse_reply_info_create ( p , end , info , features , s ) ;
2022-02-14 05:01:01 +00:00
else if ( op = = CEPH_MDS_OP_GETVXATTR )
return parse_reply_info_getvxattr ( p , end , info , features ) ;
2012-12-28 09:56:46 -08:00
else
return - EIO ;
2010-12-01 14:14:38 -08:00
}
2009-10-06 11:31:09 -07:00
/*
* parse entire mds reply
*/
2019-11-15 11:51:55 -05:00
static int parse_reply_info ( struct ceph_mds_session * s , struct ceph_msg * msg ,
2022-03-14 10:28:34 +08:00
struct ceph_mds_request * req , u64 features )
2009-10-06 11:31:09 -07:00
{
2022-03-14 10:28:34 +08:00
struct ceph_mds_reply_info_parsed * info = & req - > r_reply_info ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = s - > s_mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
void * p , * end ;
u32 len ;
int err ;
info - > head = msg - > front . iov_base ;
p = msg - > front . iov_base + sizeof ( struct ceph_mds_reply_head ) ;
end = p + msg - > front . iov_len - sizeof ( struct ceph_mds_reply_head ) ;
/* trace */
ceph_decode_32_safe ( & p , end , len , bad ) ;
if ( len > 0 ) {
2012-01-14 22:20:59 -05:00
ceph_decode_need ( & p , end , len , bad ) ;
2010-12-14 17:37:52 -08:00
err = parse_reply_info_trace ( & p , p + len , info , features ) ;
2009-10-06 11:31:09 -07:00
if ( err < 0 )
goto out_bad ;
}
2010-12-01 14:14:38 -08:00
/* extra */
2009-10-06 11:31:09 -07:00
ceph_decode_32_safe ( & p , end , len , bad ) ;
if ( len > 0 ) {
2012-01-14 22:20:59 -05:00
ceph_decode_need ( & p , end , len , bad ) ;
2022-03-14 10:28:34 +08:00
err = parse_reply_info_extra ( & p , p + len , req , features , s ) ;
2009-10-06 11:31:09 -07:00
if ( err < 0 )
goto out_bad ;
}
/* snap blob */
ceph_decode_32_safe ( & p , end , len , bad ) ;
info - > snapblob_len = len ;
info - > snapblob = p ;
p + = len ;
if ( p ! = end )
goto bad ;
return 0 ;
bad :
err = - EIO ;
out_bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " mds parse_reply err %d \n " , err ) ;
2023-05-18 09:40:14 +08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
return err ;
}
static void destroy_reply_info ( struct ceph_mds_reply_info_parsed * info )
{
2020-07-27 10:16:09 -04:00
int i ;
kfree ( info - > diri . fscrypt_auth ) ;
kfree ( info - > diri . fscrypt_file ) ;
kfree ( info - > targeti . fscrypt_auth ) ;
kfree ( info - > targeti . fscrypt_file ) ;
2016-04-28 09:37:39 +08:00
if ( ! info - > dir_entries )
2014-03-29 13:41:15 +08:00
return ;
2020-07-27 10:16:09 -04:00
for ( i = 0 ; i < info - > dir_nr ; i + + ) {
struct ceph_mds_reply_dir_entry * rde = info - > dir_entries + i ;
kfree ( rde - > inode . fscrypt_auth ) ;
kfree ( rde - > inode . fscrypt_file ) ;
}
2016-04-28 09:37:39 +08:00
free_pages ( ( unsigned long ) info - > dir_entries , get_order ( info - > dir_buf_size ) ) ;
2009-10-06 11:31:09 -07:00
}
2022-05-10 09:47:01 +08:00
/*
* In async unlink case the kclient won ' t wait for the first reply
* from MDS and just drop all the links and unhash the dentry and then
* succeeds immediately .
*
* For any new create / link / rename , etc requests followed by using the
* same file names we must wait for the first reply of the inflight
* unlink request , or the MDS possibly will fail these following
* requests with - EEXIST if the inflight async unlink request was
* delayed for some reasons .
*
* And the worst case is that for the none async openc request it will
* successfully open the file if the CDentry hasn ' t been unlinked yet ,
* but later the previous delayed async unlink request will remove the
* CDenty . That means the just created file is possiblly deleted later
* by accident .
*
* We need to wait for the inflight async unlink requests to finish
* when creating new files / directories by using the same file names .
*/
int ceph_wait_on_conflict_unlink ( struct dentry * dentry )
{
2023-06-12 10:50:38 +08:00
struct ceph_fs_client * fsc = ceph_sb_to_fs_client ( dentry - > d_sb ) ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = fsc - > client ;
2022-05-10 09:47:01 +08:00
struct dentry * pdentry = dentry - > d_parent ;
struct dentry * udentry , * found = NULL ;
struct ceph_dentry_info * di ;
struct qstr dname ;
u32 hash = dentry - > d_name . hash ;
int err ;
dname . name = dentry - > d_name . name ;
dname . len = dentry - > d_name . len ;
rcu_read_lock ( ) ;
hash_for_each_possible_rcu ( fsc - > async_unlink_conflict , di ,
hnode , hash ) {
udentry = di - > dentry ;
spin_lock ( & udentry - > d_lock ) ;
if ( udentry - > d_name . hash ! = hash )
goto next ;
if ( unlikely ( udentry - > d_parent ! = pdentry ) )
goto next ;
if ( ! hash_hashed ( & di - > hnode ) )
goto next ;
if ( ! test_bit ( CEPH_DENTRY_ASYNC_UNLINK_BIT , & di - > flags ) )
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl , " dentry %p:%pd async unlink bit is not set \n " ,
dentry , dentry ) ;
2022-05-10 09:47:01 +08:00
if ( ! d_same_name ( udentry , pdentry , & dname ) )
goto next ;
2023-09-14 21:55:29 -04:00
found = dget_dlock ( udentry ) ;
2022-05-10 09:47:01 +08:00
spin_unlock ( & udentry - > d_lock ) ;
break ;
next :
spin_unlock ( & udentry - > d_lock ) ;
}
rcu_read_unlock ( ) ;
if ( likely ( ! found ) )
return 0 ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " dentry %p:%pd conflict with old %p:%pd \n " , dentry , dentry ,
found , found ) ;
2022-05-10 09:47:01 +08:00
err = wait_on_bit ( & di - > flags , CEPH_DENTRY_ASYNC_UNLINK_BIT ,
TASK_KILLABLE ) ;
dput ( found ) ;
return err ;
}
2009-10-06 11:31:09 -07:00
/*
* sessions
*/
2014-09-19 13:51:08 +01:00
const char * ceph_session_state_name ( int s )
2009-10-06 11:31:09 -07:00
{
switch ( s ) {
case CEPH_MDS_SESSION_NEW : return " new " ;
case CEPH_MDS_SESSION_OPENING : return " opening " ;
case CEPH_MDS_SESSION_OPEN : return " open " ;
case CEPH_MDS_SESSION_HUNG : return " hung " ;
case CEPH_MDS_SESSION_CLOSING : return " closing " ;
2019-12-05 22:35:51 -05:00
case CEPH_MDS_SESSION_CLOSED : return " closed " ;
2010-02-15 12:08:46 -08:00
case CEPH_MDS_SESSION_RESTARTING : return " restarting " ;
2009-10-06 11:31:09 -07:00
case CEPH_MDS_SESSION_RECONNECTING : return " reconnecting " ;
2016-09-14 16:39:51 +08:00
case CEPH_MDS_SESSION_REJECTED : return " rejected " ;
2009-10-06 11:31:09 -07:00
default : return " ??? " ;
}
}
2019-12-19 19:44:09 -05:00
struct ceph_mds_session * ceph_get_mds_session ( struct ceph_mds_session * s )
2009-10-06 11:31:09 -07:00
{
2021-01-26 13:22:31 -05:00
if ( refcount_inc_not_zero ( & s - > s_ref ) )
2009-10-06 11:31:09 -07:00
return s ;
2021-01-26 13:22:31 -05:00
return NULL ;
2009-10-06 11:31:09 -07:00
}
void ceph_put_mds_session ( struct ceph_mds_session * s )
{
2021-06-09 14:09:52 -04:00
if ( IS_ERR_OR_NULL ( s ) )
return ;
2017-03-03 11:15:06 +02:00
if ( refcount_dec_and_test ( & s - > s_ref ) ) {
2012-05-16 15:16:38 -05:00
if ( s - > s_auth . authorizer )
2016-04-11 19:34:49 +02:00
ceph_auth_destroy_authorizer ( s - > s_auth . authorizer ) ;
2020-03-20 17:07:36 -04:00
WARN_ON ( mutex_is_locked ( & s - > s_mutex ) ) ;
2019-11-15 11:51:55 -05:00
xa_destroy ( & s - > s_delegated_inos ) ;
2009-10-06 11:31:09 -07:00
kfree ( s ) ;
2009-11-18 16:19:57 -08:00
}
2009-10-06 11:31:09 -07:00
}
/*
* called under mdsc - > mutex
*/
struct ceph_mds_session * __ceph_lookup_mds_session ( struct ceph_mds_client * mdsc ,
int mds )
{
2017-08-20 20:22:02 +02:00
if ( mds > = mdsc - > max_sessions | | ! mdsc - > sessions [ mds ] )
2009-10-06 11:31:09 -07:00
return NULL ;
2019-12-19 19:44:09 -05:00
return ceph_get_mds_session ( mdsc - > sessions [ mds ] ) ;
2009-10-06 11:31:09 -07:00
}
static bool __have_session ( struct ceph_mds_client * mdsc , int mds )
{
2018-03-13 10:43:45 +08:00
if ( mds > = mdsc - > max_sessions | | ! mdsc - > sessions [ mds ] )
2009-10-06 11:31:09 -07:00
return false ;
2018-03-13 10:43:45 +08:00
else
return true ;
2009-10-06 11:31:09 -07:00
}
2010-02-22 15:12:16 -08:00
static int __verify_registered_session ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * s )
{
if ( s - > s_mds > = mdsc - > max_sessions | |
mdsc - > sessions [ s - > s_mds ] ! = s )
return - ENOENT ;
return 0 ;
}
2009-10-06 11:31:09 -07:00
/*
* create + register a new session for given mds .
* called under mdsc - > mutex .
*/
static struct ceph_mds_session * register_session ( struct ceph_mds_client * mdsc ,
int mds )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_session * s ;
2023-02-01 09:36:45 +08:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_FENCE_IO )
return ERR_PTR ( - EIO ) ;
2019-12-04 06:57:39 -05:00
if ( mds > = mdsc - > mdsmap - > possible_max_rank )
2013-08-04 21:04:30 -07:00
return ERR_PTR ( - EINVAL ) ;
2009-10-06 11:31:09 -07:00
s = kzalloc ( sizeof ( * s ) , GFP_NOFS ) ;
2010-03-20 15:30:16 +03:00
if ( ! s )
return ERR_PTR ( - ENOMEM ) ;
2018-03-13 23:01:07 +08:00
if ( mds > = mdsc - > max_sessions ) {
int newmax = 1 < < get_count_order ( mds + 1 ) ;
struct ceph_mds_session * * sa ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " realloc to %d \n " , newmax ) ;
2018-03-13 23:01:07 +08:00
sa = kcalloc ( newmax , sizeof ( void * ) , GFP_NOFS ) ;
if ( ! sa )
goto fail_realloc ;
if ( mdsc - > sessions ) {
memcpy ( sa , mdsc - > sessions ,
mdsc - > max_sessions * sizeof ( void * ) ) ;
kfree ( mdsc - > sessions ) ;
}
mdsc - > sessions = sa ;
mdsc - > max_sessions = newmax ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d \n " , mds ) ;
2009-10-06 11:31:09 -07:00
s - > s_mdsc = mdsc ;
s - > s_mds = mds ;
s - > s_state = CEPH_MDS_SESSION_NEW ;
mutex_init ( & s - > s_mutex ) ;
2012-06-27 12:24:08 -07:00
ceph_con_init ( & s - > s_con , s , & mds_con_ops , & mdsc - > fsc - > client - > msgr ) ;
2009-10-06 11:31:09 -07:00
2021-06-04 12:03:09 -04:00
atomic_set ( & s - > s_cap_gen , 1 ) ;
2012-01-12 17:48:11 -08:00
s - > s_cap_ttl = jiffies - 1 ;
2012-01-12 17:48:10 -08:00
spin_lock_init ( & s - > s_cap_lock ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & s - > s_caps ) ;
2017-03-03 11:15:06 +02:00
refcount_set ( & s - > s_ref , 1 ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & s - > s_waiting ) ;
INIT_LIST_HEAD ( & s - > s_unsafe ) ;
2019-11-15 11:51:55 -05:00
xa_init ( & s - > s_delegated_inos ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & s - > s_cap_releases ) ;
2019-01-14 17:21:19 +08:00
INIT_WORK ( & s - > s_cap_release_work , ceph_cap_release_work ) ;
2020-04-01 17:07:52 -04:00
INIT_LIST_HEAD ( & s - > s_cap_dirty ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & s - > s_cap_flushing ) ;
mdsc - > sessions [ mds ] = s ;
2015-01-09 17:00:42 +08:00
atomic_inc ( & mdsc - > num_sessions ) ;
2017-03-03 11:15:06 +02:00
refcount_inc ( & s - > s_ref ) ; /* one ref to sessions[], one to caller */
2009-11-18 11:22:36 -08:00
2012-06-27 12:24:08 -07:00
ceph_con_open ( & s - > s_con , CEPH_ENTITY_TYPE_MDS , mds ,
ceph_mdsmap_get_addr ( mdsc - > mdsmap , mds ) ) ;
2009-11-18 11:22:36 -08:00
2009-10-06 11:31:09 -07:00
return s ;
2009-11-18 11:22:36 -08:00
fail_realloc :
kfree ( s ) ;
return ERR_PTR ( - ENOMEM ) ;
2009-10-06 11:31:09 -07:00
}
/*
* called under mdsc - > mutex
*/
2010-02-22 15:12:16 -08:00
static void __unregister_session ( struct ceph_mds_client * mdsc ,
2009-11-18 11:22:36 -08:00
struct ceph_mds_session * s )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " mds%d %p \n " , s - > s_mds , s ) ;
2010-02-22 15:12:16 -08:00
BUG_ON ( mdsc - > sessions [ s - > s_mds ] ! = s ) ;
2009-11-18 11:22:36 -08:00
mdsc - > sessions [ s - > s_mds ] = NULL ;
ceph_con_close ( & s - > s_con ) ;
ceph_put_mds_session ( s ) ;
2015-01-09 17:00:42 +08:00
atomic_dec ( & mdsc - > num_sessions ) ;
2009-10-06 11:31:09 -07:00
}
/*
* drop session refs in request .
*
* should be last request ref , or hold mdsc - > mutex
*/
static void put_request_session ( struct ceph_mds_request * req )
{
if ( req - > r_session ) {
ceph_put_mds_session ( req - > r_session ) ;
req - > r_session = NULL ;
}
}
2021-07-05 09:22:55 +08:00
void ceph_mdsc_iterate_sessions ( struct ceph_mds_client * mdsc ,
void ( * cb ) ( struct ceph_mds_session * ) ,
bool check_state )
{
int mds ;
mutex_lock ( & mdsc - > mutex ) ;
for ( mds = 0 ; mds < mdsc - > max_sessions ; + + mds ) {
struct ceph_mds_session * s ;
s = __ceph_lookup_mds_session ( mdsc , mds ) ;
if ( ! s )
continue ;
if ( check_state & & ! check_session_state ( s ) ) {
ceph_put_mds_session ( s ) ;
continue ;
}
mutex_unlock ( & mdsc - > mutex ) ;
cb ( s ) ;
ceph_put_mds_session ( s ) ;
mutex_lock ( & mdsc - > mutex ) ;
}
mutex_unlock ( & mdsc - > mutex ) ;
}
2009-12-07 12:31:09 -08:00
void ceph_mdsc_release_request ( struct kref * kref )
2009-10-06 11:31:09 -07:00
{
2009-12-07 12:31:09 -08:00
struct ceph_mds_request * req = container_of ( kref ,
struct ceph_mds_request ,
r_kref ) ;
2020-05-27 09:09:27 -04:00
ceph_mdsc_release_dir_caps_no_check ( req ) ;
2014-03-29 13:41:15 +08:00
destroy_reply_info ( & req - > r_reply_info ) ;
2009-12-07 12:31:09 -08:00
if ( req - > r_request )
ceph_msg_put ( req - > r_request ) ;
2014-03-29 13:41:15 +08:00
if ( req - > r_reply )
2009-12-07 12:31:09 -08:00
ceph_msg_put ( req - > r_reply ) ;
if ( req - > r_inode ) {
2011-07-26 11:31:14 -07:00
ceph_put_cap_refs ( ceph_inode ( req - > r_inode ) , CEPH_CAP_PIN ) ;
2021-06-04 12:03:09 -04:00
iput ( req - > r_inode ) ;
2009-12-07 12:31:09 -08:00
}
2019-04-03 13:16:01 -04:00
if ( req - > r_parent ) {
2017-01-31 10:28:26 -05:00
ceph_put_cap_refs ( ceph_inode ( req - > r_parent ) , CEPH_CAP_PIN ) ;
2021-06-04 12:03:09 -04:00
iput ( req - > r_parent ) ;
2019-04-03 13:16:01 -04:00
}
2021-06-04 12:03:09 -04:00
iput ( req - > r_target_inode ) ;
ceph: preallocate inode for ops that may create one
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-26 13:11:00 -04:00
iput ( req - > r_new_inode ) ;
2009-12-07 12:31:09 -08:00
if ( req - > r_dentry )
dput ( req - > r_dentry ) ;
2013-02-05 13:40:09 -08:00
if ( req - > r_old_dentry )
dput ( req - > r_old_dentry ) ;
if ( req - > r_old_dentry_dir ) {
2011-07-26 11:31:14 -07:00
/*
* track ( and drop pins for ) r_old_dentry_dir
* separately , since r_old_dentry ' s d_parent may have
* changed between the dir mutex being dropped and
* this request being freed .
*/
ceph_put_cap_refs ( ceph_inode ( req - > r_old_dentry_dir ) ,
CEPH_CAP_PIN ) ;
2021-06-04 12:03:09 -04:00
iput ( req - > r_old_dentry_dir ) ;
2009-10-06 11:31:09 -07:00
}
2009-12-07 12:31:09 -08:00
kfree ( req - > r_path1 ) ;
kfree ( req - > r_path2 ) ;
2020-12-08 11:24:09 -05:00
put_cred ( req - > r_cred ) ;
2023-08-07 15:26:16 +02:00
if ( req - > r_mnt_idmap )
mnt_idmap_put ( req - > r_mnt_idmap ) ;
2014-09-16 19:15:28 +08:00
if ( req - > r_pagelist )
ceph_pagelist_release ( req - > r_pagelist ) ;
2020-07-27 10:16:09 -04:00
kfree ( req - > r_fscrypt_auth ) ;
2021-01-14 10:39:22 -05:00
kfree ( req - > r_altname ) ;
2009-12-07 12:31:09 -08:00
put_request_session ( req ) ;
2010-06-17 16:16:12 -07:00
ceph_unreserve_caps ( req - > r_mdsc , & req - > r_caps_reservation ) ;
2019-06-14 10:55:05 +08:00
WARN_ON_ONCE ( ! list_empty ( & req - > r_wait ) ) ;
2020-02-17 18:38:37 -05:00
kmem_cache_free ( ceph_mds_request_cachep , req ) ;
2009-10-06 11:31:09 -07:00
}
2016-04-28 16:07:22 +02:00
DEFINE_RB_FUNCS ( request , struct ceph_mds_request , r_tid , r_node )
2009-10-06 11:31:09 -07:00
/*
* lookup session , bump ref if found .
*
* called under mdsc - > mutex .
*/
2016-04-28 16:07:22 +02:00
static struct ceph_mds_request *
lookup_get_request ( struct ceph_mds_client * mdsc , u64 tid )
2009-10-06 11:31:09 -07:00
{
struct ceph_mds_request * req ;
2010-02-15 12:08:46 -08:00
2016-04-28 16:07:22 +02:00
req = lookup_request ( & mdsc - > request_tree , tid ) ;
if ( req )
ceph_mdsc_get_request ( req ) ;
2010-02-15 12:08:46 -08:00
2016-04-28 16:07:22 +02:00
return req ;
2009-10-06 11:31:09 -07:00
}
/*
* Register an in - flight request , and assign a tid . Link to directory
* are modifying ( if any ) .
*
* Called under mdsc - > mutex .
*/
static void __register_request ( struct ceph_mds_client * mdsc ,
struct ceph_mds_request * req ,
struct inode * dir )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2018-01-24 21:24:33 +08:00
int ret = 0 ;
2009-10-06 11:31:09 -07:00
req - > r_tid = + + mdsc - > last_tid ;
2018-01-24 21:24:33 +08:00
if ( req - > r_num_caps ) {
ret = ceph_reserve_caps ( mdsc , & req - > r_caps_reservation ,
req - > r_num_caps ) ;
if ( ret < 0 ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " %p failed to reserve caps: %d \n " ,
req , ret ) ;
2018-01-24 21:24:33 +08:00
/* set req->r_err to fail early from __do_request */
req - > r_err = ret ;
return ;
}
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p tid %lld \n " , req , req - > r_tid ) ;
2009-10-06 11:31:09 -07:00
ceph_mdsc_get_request ( req ) ;
2016-04-28 16:07:22 +02:00
insert_request ( & mdsc - > request_tree , req ) ;
2009-10-06 11:31:09 -07:00
2020-12-08 11:24:09 -05:00
req - > r_cred = get_current_cred ( ) ;
2023-08-07 15:26:16 +02:00
if ( ! req - > r_mnt_idmap )
req - > r_mnt_idmap = & nop_mnt_idmap ;
2010-11-08 07:28:52 -08:00
2015-05-19 18:54:40 +08:00
if ( mdsc - > oldest_tid = = 0 & & req - > r_op ! = CEPH_MDS_OP_SETFILELOCK )
mdsc - > oldest_tid = req - > r_tid ;
2009-10-06 11:31:09 -07:00
if ( dir ) {
2019-04-04 08:05:38 -04:00
struct ceph_inode_info * ci = ceph_inode ( dir ) ;
2011-05-18 16:12:12 -07:00
ihold ( dir ) ;
2009-10-06 11:31:09 -07:00
req - > r_unsafe_dir = dir ;
2019-04-04 08:05:38 -04:00
spin_lock ( & ci - > i_unsafe_lock ) ;
list_add_tail ( & req - > r_unsafe_dir_item , & ci - > i_unsafe_dirops ) ;
spin_unlock ( & ci - > i_unsafe_lock ) ;
2009-10-06 11:31:09 -07:00
}
}
static void __unregister_request ( struct ceph_mds_client * mdsc ,
struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " %p tid %lld \n " , req , req - > r_tid ) ;
2015-05-19 18:54:40 +08:00
2017-02-14 10:09:40 -05:00
/* Never leave an unregistered request on an unsafe list! */
list_del_init ( & req - > r_unsafe_item ) ;
2015-05-19 18:54:40 +08:00
if ( req - > r_tid = = mdsc - > oldest_tid ) {
struct rb_node * p = rb_next ( & req - > r_node ) ;
mdsc - > oldest_tid = 0 ;
while ( p ) {
struct ceph_mds_request * next_req =
rb_entry ( p , struct ceph_mds_request , r_node ) ;
if ( next_req - > r_op ! = CEPH_MDS_OP_SETFILELOCK ) {
mdsc - > oldest_tid = next_req - > r_tid ;
break ;
}
p = rb_next ( p ) ;
}
}
2016-04-28 16:07:22 +02:00
erase_request ( & mdsc - > request_tree , req ) ;
2009-10-06 11:31:09 -07:00
2019-04-04 08:05:38 -04:00
if ( req - > r_unsafe_dir ) {
2009-10-06 11:31:09 -07:00
struct ceph_inode_info * ci = ceph_inode ( req - > r_unsafe_dir ) ;
spin_lock ( & ci - > i_unsafe_lock ) ;
list_del_init ( & req - > r_unsafe_dir_item ) ;
spin_unlock ( & ci - > i_unsafe_lock ) ;
2015-10-27 17:18:00 +08:00
}
2017-02-01 13:49:09 -05:00
if ( req - > r_target_inode & &
test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) ) {
2015-10-27 18:36:06 +08:00
struct ceph_inode_info * ci = ceph_inode ( req - > r_target_inode ) ;
spin_lock ( & ci - > i_unsafe_lock ) ;
list_del_init ( & req - > r_unsafe_target_item ) ;
spin_unlock ( & ci - > i_unsafe_lock ) ;
}
2011-05-18 16:12:12 -07:00
2015-10-27 17:18:00 +08:00
if ( req - > r_unsafe_dir ) {
2021-06-04 12:03:09 -04:00
iput ( req - > r_unsafe_dir ) ;
2011-05-18 16:12:12 -07:00
req - > r_unsafe_dir = NULL ;
2009-10-06 11:31:09 -07:00
}
2010-03-28 21:22:50 -07:00
2013-10-31 09:10:47 +08:00
complete_all ( & req - > r_safe_completion ) ;
2010-03-28 21:22:50 -07:00
ceph_mdsc_put_request ( req ) ;
2009-10-06 11:31:09 -07:00
}
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
/*
* Walk back up the dentry tree until we hit a dentry representing a
* non - snapshot inode . We do this using the rcu_read_lock ( which must be held
* when calling this ) to ensure that the objects won ' t disappear while we ' re
* working with them . Once we hit a candidate dentry , we attempt to take a
* reference to it , and return that as the result .
*/
2017-02-23 13:39:59 +03:00
static struct inode * get_nonsnap_parent ( struct dentry * dentry )
{
struct inode * inode = NULL ;
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
while ( dentry & & ! IS_ROOT ( dentry ) ) {
inode = d_inode_rcu ( dentry ) ;
if ( ! inode | | ceph_snap ( inode ) = = CEPH_NOSNAP )
break ;
dentry = dentry - > d_parent ;
}
if ( inode )
inode = igrab ( inode ) ;
return inode ;
}
2009-10-06 11:31:09 -07:00
/*
* Choose mds to send request to next . If there is a hint set in the
* request ( e . g . , due to a prior forward hint from the mds ) , use that .
* Otherwise , consult frag tree and / or caps to identify the
* appropriate mds . If all else fails , choose randomly .
*
* Called under mdsc - > mutex .
*/
static int __choose_mds ( struct ceph_mds_client * mdsc ,
2019-12-09 07:47:15 -05:00
struct ceph_mds_request * req ,
bool * random )
2009-10-06 11:31:09 -07:00
{
struct inode * inode ;
struct ceph_inode_info * ci ;
struct ceph_cap * cap ;
int mode = req - > r_direct_mode ;
int mds = - 1 ;
u32 hash = req - > r_direct_hash ;
2017-02-01 13:49:09 -05:00
bool is_hash = test_bit ( CEPH_MDS_R_DIRECT_IS_HASH , & req - > r_req_flags ) ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
2019-12-09 07:47:15 -05:00
if ( random )
* random = false ;
2009-10-06 11:31:09 -07:00
/*
* is there a specific mds we should try ? ignore hint if we have
* no session and the mds is not up ( active or recovering ) .
*/
if ( req - > r_resend_mds > = 0 & &
( __have_session ( mdsc , req - > r_resend_mds ) | |
ceph_mdsmap_get_state ( mdsc - > mdsmap , req - > r_resend_mds ) > 0 ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " using resend_mds mds%d \n " , req - > r_resend_mds ) ;
2009-10-06 11:31:09 -07:00
return req - > r_resend_mds ;
}
if ( mode = = USE_RANDOM_MDS )
goto random ;
inode = NULL ;
if ( req - > r_inode ) {
2017-07-26 12:48:08 +08:00
if ( ceph_snap ( req - > r_inode ) ! = CEPH_SNAPDIR ) {
inode = req - > r_inode ;
ihold ( inode ) ;
} else {
2017-09-22 11:41:06 +08:00
/* req->r_dentry is non-null for LSSNAP request */
rcu_read_lock ( ) ;
inode = get_nonsnap_parent ( req - > r_dentry ) ;
rcu_read_unlock ( ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " using snapdir's parent %p %llx.%llx \n " ,
inode , ceph_vinop ( inode ) ) ;
2017-07-26 12:48:08 +08:00
}
2017-09-22 11:41:06 +08:00
} else if ( req - > r_dentry ) {
2011-07-26 11:31:26 -07:00
/* ignore race with rename; old or new d_parent is okay */
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
struct dentry * parent ;
struct inode * dir ;
rcu_read_lock ( ) ;
2019-05-23 10:22:55 +08:00
parent = READ_ONCE ( req - > r_dentry - > d_parent ) ;
2017-01-31 10:28:26 -05:00
dir = req - > r_parent ? : d_inode_rcu ( parent ) ;
2010-08-16 09:21:27 -07:00
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
if ( ! dir | | dir - > i_sb ! = mdsc - > fsc - > sb ) {
/* not this fs or parent went negative */
2015-03-17 22:25:59 +00:00
inode = d_inode ( req - > r_dentry ) ;
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
if ( inode )
ihold ( inode ) ;
2010-08-16 09:21:27 -07:00
} else if ( ceph_snap ( dir ) ! = CEPH_NOSNAP ) {
/* direct snapped/virtual snapdir requests
* based on parent dir inode */
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
inode = get_nonsnap_parent ( parent ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " using nonsnap parent %p %llx.%llx \n " ,
inode , ceph_vinop ( inode ) ) ;
2013-11-22 14:21:44 +08:00
} else {
2010-08-16 09:21:27 -07:00
/* dentry target */
2015-03-17 22:25:59 +00:00
inode = d_inode ( req - > r_dentry ) ;
2013-11-22 14:21:44 +08:00
if ( ! inode | | mode = = USE_AUTH_MDS ) {
/* dir + name */
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
inode = igrab ( dir ) ;
2013-11-22 14:21:44 +08:00
hash = ceph_dentry_hash ( dir , req - > r_dentry ) ;
is_hash = true ;
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
} else {
ihold ( inode ) ;
2013-11-22 14:21:44 +08:00
}
2009-10-06 11:31:09 -07:00
}
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
rcu_read_unlock ( ) ;
2009-10-06 11:31:09 -07:00
}
2010-08-16 09:21:27 -07:00
2009-10-06 11:31:09 -07:00
if ( ! inode )
goto random ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx is_hash=%d (0x%x) mode %d \n " , inode ,
ceph_vinop ( inode ) , ( int ) is_hash , hash , mode ) ;
2009-10-06 11:31:09 -07:00
ci = ceph_inode ( inode ) ;
if ( is_hash & & S_ISDIR ( inode - > i_mode ) ) {
struct ceph_inode_frag frag ;
int found ;
ceph_choose_frag ( ci , hash , & frag , & found ) ;
if ( found ) {
if ( mode = = USE_ANY_MDS & & frag . ndist > 0 ) {
u8 r ;
/* choose a random replica */
get_random_bytes ( & r , 1 ) ;
r % = frag . ndist ;
mds = frag . dist [ r ] ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx frag %u mds%d (%d/%d) \n " ,
inode , ceph_vinop ( inode ) , frag . frag ,
mds , ( int ) r , frag . ndist ) ;
2011-01-21 21:16:46 -08:00
if ( ceph_mdsmap_get_state ( mdsc - > mdsmap , mds ) > =
2019-11-26 07:24:22 -05:00
CEPH_MDS_STATE_ACTIVE & &
! ceph_mdsmap_is_laggy ( mdsc - > mdsmap , mds ) )
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
goto out ;
2009-10-06 11:31:09 -07:00
}
/* since this file/dir wasn't known to be
* replicated , then we want to look for the
* authoritative mds . */
if ( frag . mds > = 0 ) {
/* choose auth mds */
mds = frag . mds ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx frag %u mds%d (auth) \n " ,
inode , ceph_vinop ( inode ) , frag . frag , mds ) ;
2011-01-21 21:16:46 -08:00
if ( ceph_mdsmap_get_state ( mdsc - > mdsmap , mds ) > =
2019-11-26 07:24:22 -05:00
CEPH_MDS_STATE_ACTIVE ) {
2020-07-31 16:25:13 +08:00
if ( ! ceph_mdsmap_is_laggy ( mdsc - > mdsmap ,
2019-11-26 07:24:22 -05:00
mds ) )
goto out ;
}
2009-10-06 11:31:09 -07:00
}
2019-11-26 07:24:22 -05:00
mode = USE_AUTH_MDS ;
2009-10-06 11:31:09 -07:00
}
}
2011-11-30 09:47:09 -08:00
spin_lock ( & ci - > i_ceph_lock ) ;
2009-10-06 11:31:09 -07:00
cap = NULL ;
if ( mode = = USE_AUTH_MDS )
cap = ci - > i_auth_cap ;
if ( ! cap & & ! RB_EMPTY_ROOT ( & ci - > i_caps ) )
cap = rb_entry ( rb_first ( & ci - > i_caps ) , struct ceph_cap , ci_node ) ;
if ( ! cap ) {
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2021-06-04 12:03:09 -04:00
iput ( inode ) ;
2009-10-06 11:31:09 -07:00
goto random ;
}
mds = cap - > session - > s_mds ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx mds%d (%scap %p) \n " , inode ,
ceph_vinop ( inode ) , mds ,
cap = = ci - > i_auth_cap ? " auth " : " " , cap ) ;
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
ceph: clean up unsafe d_parent access in __choose_mds
__choose_mds exists to pick an MDS to use when issuing a call. Doing
that typically involves picking an inode and using the authoritative
MDS for it. In most cases, that's pretty straightforward, as we are
using an inode to which we hold a reference (usually represented by
r_dentry or r_inode in the request).
In the case of a snapshotted directory however, we need to fetch
the non-snapped parent, which involves walking back up the parents
in the tree. The dentries in the snapshot dir are effectively frozen
but the overall parent is _not_, and could vanish if a concurrent
rename were to occur.
Clean this code up and take special care to ensure the validity of
the entries we're working with. First, try to use the inode in
r_locked_dir if one exists. If not and all we have is r_dentry,
then we have to walk back up the tree. Use the rcu_read_lock for
this so we can ensure that any d_parent we find won't go away, and
take extra care to deal with the possibility that the dentries could
go negative.
Change get_nonsnap_parent to return an inode, and take a reference to
that inode before returning (if any). Change all of the other places
where we set "inode" in __choose_mds to also take a reference, and then
call iput on that inode before exiting the function.
Link: http://tracker.ceph.com/issues/18148
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-15 08:37:56 -05:00
out :
2021-06-04 12:03:09 -04:00
iput ( inode ) ;
2009-10-06 11:31:09 -07:00
return mds ;
random :
2019-12-09 07:47:15 -05:00
if ( random )
* random = true ;
2009-10-06 11:31:09 -07:00
mds = ceph_mdsmap_get_random_mds ( mdsc - > mdsmap ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " chose random mds%d \n " , mds ) ;
2009-10-06 11:31:09 -07:00
return mds ;
}
/*
* session messages
*/
2021-07-05 09:22:54 +08:00
struct ceph_msg * ceph_create_session_msg ( u32 op , u64 seq )
2009-10-06 11:31:09 -07:00
{
struct ceph_msg * msg ;
struct ceph_mds_session_head * h ;
2011-08-09 15:03:46 -07:00
msg = ceph_msg_new ( CEPH_MSG_CLIENT_SESSION , sizeof ( * h ) , GFP_NOFS ,
false ) ;
2010-04-01 16:06:19 -07:00
if ( ! msg ) {
2021-07-05 09:22:54 +08:00
pr_err ( " ENOMEM creating session %s msg \n " ,
ceph_session_op_name ( op ) ) ;
2010-04-01 16:06:19 -07:00
return NULL ;
2009-10-06 11:31:09 -07:00
}
h = msg - > front . iov_base ;
h - > op = cpu_to_le32 ( op ) ;
h - > seq = cpu_to_le64 ( seq ) ;
2014-09-09 19:26:01 +01:00
return msg ;
}
2020-01-08 05:17:31 -05:00
static const unsigned char feature_bits [ ] = CEPHFS_FEATURES_CLIENT_SUPPORTED ;
# define FEATURE_BYTES(c) (DIV_ROUND_UP((size_t)feature_bits[c - 1] + 1, 64) * 8)
2020-06-30 03:52:18 -04:00
static int encode_supported_features ( void * * p , void * end )
2018-05-11 18:47:29 +08:00
{
2020-01-08 05:17:31 -05:00
static const size_t count = ARRAY_SIZE ( feature_bits ) ;
2018-05-11 18:47:29 +08:00
if ( count > 0 ) {
size_t i ;
2020-01-08 05:17:31 -05:00
size_t size = FEATURE_BYTES ( count ) ;
2022-05-24 17:06:27 +01:00
unsigned long bit ;
2018-05-11 18:47:29 +08:00
2020-06-30 03:52:18 -04:00
if ( WARN_ON_ONCE ( * p + 4 + size > end ) )
return - ERANGE ;
2018-05-11 18:47:29 +08:00
ceph_encode_32 ( p , size ) ;
memset ( * p , 0 , size ) ;
2022-05-24 17:06:27 +01:00
for ( i = 0 ; i < count ; i + + ) {
bit = feature_bits [ i ] ;
( ( unsigned char * ) ( * p ) ) [ bit / 8 ] | = BIT ( bit % 8 ) ;
}
2018-05-11 18:47:29 +08:00
* p + = size ;
} else {
2020-06-30 03:52:18 -04:00
if ( WARN_ON_ONCE ( * p + 4 > end ) )
return - ERANGE ;
2018-05-11 18:47:29 +08:00
ceph_encode_32 ( p , 0 ) ;
}
2020-06-30 03:52:18 -04:00
return 0 ;
2018-05-11 18:47:29 +08:00
}
2020-07-16 10:05:58 -04:00
static const unsigned char metric_bits [ ] = CEPHFS_METRIC_SPEC_CLIENT_SUPPORTED ;
# define METRIC_BYTES(cnt) (DIV_ROUND_UP((size_t)metric_bits[cnt - 1] + 1, 64) * 8)
static int encode_metric_spec ( void * * p , void * end )
{
static const size_t count = ARRAY_SIZE ( metric_bits ) ;
/* header */
if ( WARN_ON_ONCE ( * p + 2 > end ) )
return - ERANGE ;
ceph_encode_8 ( p , 1 ) ; /* version */
ceph_encode_8 ( p , 1 ) ; /* compat */
if ( count > 0 ) {
size_t i ;
size_t size = METRIC_BYTES ( count ) ;
if ( WARN_ON_ONCE ( * p + 4 + 4 + size > end ) )
return - ERANGE ;
/* metric spec info length */
ceph_encode_32 ( p , 4 + size ) ;
/* metric spec */
ceph_encode_32 ( p , size ) ;
memset ( * p , 0 , size ) ;
for ( i = 0 ; i < count ; i + + )
( ( unsigned char * ) ( * p ) ) [ i / 8 ] | = BIT ( metric_bits [ i ] % 8 ) ;
* p + = size ;
} else {
if ( WARN_ON_ONCE ( * p + 4 + 4 > end ) )
return - ERANGE ;
/* metric spec info length */
ceph_encode_32 ( p , 4 ) ;
/* metric spec */
ceph_encode_32 ( p , 0 ) ;
}
return 0 ;
}
2014-09-09 19:26:01 +01:00
/*
* session message , specialization for CEPH_SESSION_REQUEST_OPEN
* to include additional client metadata fields .
*/
static struct ceph_msg * create_session_open_msg ( struct ceph_mds_client * mdsc , u64 seq )
{
struct ceph_msg * msg ;
struct ceph_mds_session_head * h ;
2020-12-04 18:54:21 +00:00
int i ;
2018-05-11 18:47:29 +08:00
int extra_bytes = 0 ;
2014-09-09 19:26:01 +01:00
int metadata_key_count = 0 ;
struct ceph_options * opt = mdsc - > fsc - > client - > options ;
2016-04-21 11:09:55 +08:00
struct ceph_mount_options * fsopt = mdsc - > fsc - > mount_options ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2020-01-08 05:17:31 -05:00
size_t size , count ;
2018-05-11 18:47:29 +08:00
void * p , * end ;
2020-06-30 03:52:18 -04:00
int ret ;
2014-09-09 19:26:01 +01:00
2015-01-16 10:54:43 +08:00
const char * metadata [ ] [ 2 ] = {
2017-09-11 12:10:08 +08:00
{ " hostname " , mdsc - > nodename } ,
{ " kernel_version " , init_utsname ( ) - > release } ,
2016-04-21 11:09:55 +08:00
{ " entity_id " , opt - > name ? : " " } ,
{ " root " , fsopt - > server_path ? : " / " } ,
2014-09-09 19:26:01 +01:00
{ NULL , NULL }
} ;
/* Calculate serialized length of metadata */
2018-05-11 18:47:29 +08:00
extra_bytes = 4 ; /* map length */
2017-08-20 20:22:02 +02:00
for ( i = 0 ; metadata [ i ] [ 0 ] ; + + i ) {
2018-05-11 18:47:29 +08:00
extra_bytes + = 8 + strlen ( metadata [ i ] [ 0 ] ) +
2014-09-09 19:26:01 +01:00
strlen ( metadata [ i ] [ 1 ] ) ;
metadata_key_count + + ;
}
2020-01-08 05:17:31 -05:00
2018-05-11 18:47:29 +08:00
/* supported feature */
2020-01-08 05:17:31 -05:00
size = 0 ;
count = ARRAY_SIZE ( feature_bits ) ;
if ( count > 0 )
size = FEATURE_BYTES ( count ) ;
extra_bytes + = 4 + size ;
2014-09-09 19:26:01 +01:00
2020-07-16 10:05:58 -04:00
/* metric spec */
size = 0 ;
count = ARRAY_SIZE ( metric_bits ) ;
if ( count > 0 )
size = METRIC_BYTES ( count ) ;
extra_bytes + = 2 + 4 + 4 + size ;
2014-09-09 19:26:01 +01:00
/* Allocate the message */
2018-05-11 18:47:29 +08:00
msg = ceph_msg_new ( CEPH_MSG_CLIENT_SESSION , sizeof ( * h ) + extra_bytes ,
2014-09-09 19:26:01 +01:00
GFP_NOFS , false ) ;
if ( ! msg ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " ENOMEM creating session open msg \n " ) ;
2020-06-30 03:52:18 -04:00
return ERR_PTR ( - ENOMEM ) ;
2014-09-09 19:26:01 +01:00
}
2018-05-11 18:47:29 +08:00
p = msg - > front . iov_base ;
end = p + msg - > front . iov_len ;
h = p ;
2014-09-09 19:26:01 +01:00
h - > op = cpu_to_le32 ( CEPH_SESSION_REQUEST_OPEN ) ;
h - > seq = cpu_to_le64 ( seq ) ;
/*
* Serialize client metadata into waiting buffer space , using
* the format that userspace expects for map < string , string >
2014-10-30 17:15:26 +00:00
*
2020-07-16 10:05:58 -04:00
* ClientSession messages with metadata are v4
2014-09-09 19:26:01 +01:00
*/
2020-07-16 10:05:58 -04:00
msg - > hdr . version = cpu_to_le16 ( 4 ) ;
2014-10-30 17:15:26 +00:00
msg - > hdr . compat_version = cpu_to_le16 ( 1 ) ;
2014-09-09 19:26:01 +01:00
/* The write pointer, following the session_head structure */
2018-05-11 18:47:29 +08:00
p + = sizeof ( * h ) ;
2014-09-09 19:26:01 +01:00
/* Number of entries in the map */
ceph_encode_32 ( & p , metadata_key_count ) ;
/* Two length-prefixed strings for each entry in the map */
2017-08-20 20:22:02 +02:00
for ( i = 0 ; metadata [ i ] [ 0 ] ; + + i ) {
2014-09-09 19:26:01 +01:00
size_t const key_len = strlen ( metadata [ i ] [ 0 ] ) ;
size_t const val_len = strlen ( metadata [ i ] [ 1 ] ) ;
ceph_encode_32 ( & p , key_len ) ;
memcpy ( p , metadata [ i ] [ 0 ] , key_len ) ;
p + = key_len ;
ceph_encode_32 ( & p , val_len ) ;
memcpy ( p , metadata [ i ] [ 1 ] , val_len ) ;
p + = val_len ;
}
2020-06-30 03:52:18 -04:00
ret = encode_supported_features ( & p , end ) ;
if ( ret ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " encode_supported_features failed! \n " ) ;
2020-06-30 03:52:18 -04:00
ceph_msg_put ( msg ) ;
return ERR_PTR ( ret ) ;
}
2020-07-16 10:05:58 -04:00
ret = encode_metric_spec ( & p , end ) ;
if ( ret ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " encode_metric_spec failed! \n " ) ;
2020-07-16 10:05:58 -04:00
ceph_msg_put ( msg ) ;
return ERR_PTR ( ret ) ;
}
2018-05-11 18:47:29 +08:00
msg - > front . iov_len = p - msg - > front . iov_base ;
msg - > hdr . front_len = cpu_to_le32 ( msg - > front . iov_len ) ;
2009-10-06 11:31:09 -07:00
return msg ;
}
/*
* send session open request .
*
* called under mdsc - > mutex
*/
static int __open_session ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
struct ceph_msg * msg ;
int mstate ;
int mds = session - > s_mds ;
2023-02-01 09:36:45 +08:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_FENCE_IO )
return - EIO ;
2009-10-06 11:31:09 -07:00
/* wait for mds to go active? */
mstate = ceph_mdsmap_get_state ( mdsc - > mdsmap , mds ) ;
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " open_session to mds%d (%s) \n " , mds ,
ceph_mds_state_name ( mstate ) ) ;
2009-10-06 11:31:09 -07:00
session - > s_state = CEPH_MDS_SESSION_OPENING ;
session - > s_renew_requested = jiffies ;
/* send connect message */
2014-09-09 19:26:01 +01:00
msg = create_session_open_msg ( mdsc , session - > s_seq ) ;
2020-06-30 03:52:18 -04:00
if ( IS_ERR ( msg ) )
return PTR_ERR ( msg ) ;
2009-10-06 11:31:09 -07:00
ceph_con_send ( & session - > s_con , msg ) ;
return 0 ;
}
2010-06-21 13:38:25 -07:00
/*
* open sessions for any export targets for the given mds
*
* called under mdsc - > mutex
*/
2013-11-24 14:33:01 +08:00
static struct ceph_mds_session *
__open_export_target_session ( struct ceph_mds_client * mdsc , int target )
{
struct ceph_mds_session * session ;
2020-06-30 03:52:18 -04:00
int ret ;
2013-11-24 14:33:01 +08:00
session = __ceph_lookup_mds_session ( mdsc , target ) ;
if ( ! session ) {
session = register_session ( mdsc , target ) ;
if ( IS_ERR ( session ) )
return session ;
}
if ( session - > s_state = = CEPH_MDS_SESSION_NEW | |
2020-06-30 03:52:18 -04:00
session - > s_state = = CEPH_MDS_SESSION_CLOSING ) {
ret = __open_session ( mdsc , session ) ;
if ( ret )
return ERR_PTR ( ret ) ;
}
2013-11-24 14:33:01 +08:00
return session ;
}
struct ceph_mds_session *
ceph_mdsc_open_export_target_session ( struct ceph_mds_client * mdsc , int target )
{
struct ceph_mds_session * session ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2013-11-24 14:33:01 +08:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " to mds%d \n " , target ) ;
2013-11-24 14:33:01 +08:00
mutex_lock ( & mdsc - > mutex ) ;
session = __open_export_target_session ( mdsc , target ) ;
mutex_unlock ( & mdsc - > mutex ) ;
return session ;
}
2010-06-21 13:38:25 -07:00
static void __open_export_target_sessions ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
struct ceph_mds_info * mi ;
struct ceph_mds_session * ts ;
int i , mds = session - > s_mds ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2010-06-21 13:38:25 -07:00
2019-12-04 06:57:39 -05:00
if ( mds > = mdsc - > mdsmap - > possible_max_rank )
2010-06-21 13:38:25 -07:00
return ;
2013-11-24 14:33:01 +08:00
2010-06-21 13:38:25 -07:00
mi = & mdsc - > mdsmap - > m_info [ mds ] ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " for mds%d (%d targets) \n " , session - > s_mds ,
mi - > num_export_targets ) ;
2010-06-21 13:38:25 -07:00
for ( i = 0 ; i < mi - > num_export_targets ; i + + ) {
2013-11-24 14:33:01 +08:00
ts = __open_export_target_session ( mdsc , mi - > export_targets [ i ] ) ;
2021-06-09 14:09:52 -04:00
ceph_put_mds_session ( ts ) ;
2010-06-21 13:38:25 -07:00
}
}
2010-06-21 13:45:04 -07:00
void ceph_mdsc_open_export_target_sessions ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
mutex_lock ( & mdsc - > mutex ) ;
__open_export_target_sessions ( mdsc , session ) ;
mutex_unlock ( & mdsc - > mutex ) ;
}
2009-10-06 11:31:09 -07:00
/*
* session caps
*/
2017-10-19 08:53:58 -04:00
static void detach_cap_releases ( struct ceph_mds_session * session ,
struct list_head * target )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = session - > s_mdsc - > fsc - > client ;
2017-10-19 08:53:58 -04:00
lockdep_assert_held ( & session - > s_cap_lock ) ;
list_splice_init ( & session - > s_cap_releases , target ) ;
2015-05-14 17:22:42 +08:00
session - > s_num_cap_releases = 0 ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d \n " , session - > s_mds ) ;
2017-10-19 08:53:58 -04:00
}
2009-10-06 11:31:09 -07:00
2017-10-19 08:53:58 -04:00
static void dispose_cap_releases ( struct ceph_mds_client * mdsc ,
struct list_head * dispose )
{
while ( ! list_empty ( dispose ) ) {
2015-05-14 17:22:42 +08:00
struct ceph_cap * cap ;
/* zero out the in-progress message */
2017-10-19 08:53:58 -04:00
cap = list_first_entry ( dispose , struct ceph_cap , session_caps ) ;
2015-05-14 17:22:42 +08:00
list_del ( & cap - > session_caps ) ;
ceph_put_cap ( mdsc , cap ) ;
2009-10-06 11:31:09 -07:00
}
}
2015-03-24 20:15:36 +08:00
static void cleanup_session_requests ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2015-03-24 20:15:36 +08:00
struct ceph_mds_request * req ;
struct rb_node * p ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d \n " , session - > s_mds ) ;
2015-03-24 20:15:36 +08:00
mutex_lock ( & mdsc - > mutex ) ;
while ( ! list_empty ( & session - > s_unsafe ) ) {
req = list_first_entry ( & session - > s_unsafe ,
struct ceph_mds_request , r_unsafe_item ) ;
2023-06-12 09:04:07 +08:00
pr_warn_ratelimited_client ( cl , " dropping unsafe request %llu \n " ,
req - > r_tid ) ;
2021-10-07 14:19:49 -04:00
if ( req - > r_target_inode )
mapping_set_error ( req - > r_target_inode - > i_mapping , - EIO ) ;
if ( req - > r_unsafe_dir )
mapping_set_error ( req - > r_unsafe_dir - > i_mapping , - EIO ) ;
2015-03-24 20:15:36 +08:00
__unregister_request ( mdsc , req ) ;
}
/* zero r_attempts, so kick_requests() will re-send requests */
p = rb_first ( & mdsc - > request_tree ) ;
while ( p ) {
req = rb_entry ( p , struct ceph_mds_request , r_node ) ;
p = rb_next ( p ) ;
if ( req - > r_session & &
req - > r_session - > s_mds = = session - > s_mds )
req - > r_attempts = 0 ;
}
mutex_unlock ( & mdsc - > mutex ) ;
}
2009-10-06 11:31:09 -07:00
/*
2010-05-11 20:56:31 -07:00
* Helper to safely iterate over all caps associated with a session , with
* special care taken to handle a racing __ceph_remove_cap ( ) .
2009-10-06 11:31:09 -07:00
*
2010-05-11 20:56:31 -07:00
* Caller must hold session s_mutex .
2009-10-06 11:31:09 -07:00
*/
2019-04-24 12:09:04 -04:00
int ceph_iterate_session_caps ( struct ceph_mds_session * session ,
2023-04-19 10:39:14 +08:00
int ( * cb ) ( struct inode * , int mds , void * ) ,
void * arg )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = session - > s_mdsc - > fsc - > client ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
struct list_head * p ;
struct ceph_cap * cap ;
struct inode * inode , * last_inode = NULL ;
struct ceph_cap * old_cap = NULL ;
2009-10-06 11:31:09 -07:00
int ret ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p mds%d \n " , session , session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
spin_lock ( & session - > s_cap_lock ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
p = session - > s_caps . next ;
while ( p ! = & session - > s_caps ) {
2023-04-19 10:39:14 +08:00
int mds ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
cap = list_entry ( p , struct ceph_cap , session_caps ) ;
netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:
In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.
Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).
Most of the changes were done with:
perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`
Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.
Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].
Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.
[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]
Fixes: bc899ee1c898 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-09 21:46:04 +01:00
inode = igrab ( & cap - > ci - > netfs . inode ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
if ( ! inode ) {
p = p - > next ;
2009-10-06 11:31:09 -07:00
continue ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
}
session - > s_cap_iterator = cap ;
2023-04-19 10:39:14 +08:00
mds = cap - > mds ;
2009-10-06 11:31:09 -07:00
spin_unlock ( & session - > s_cap_lock ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
if ( last_inode ) {
2021-06-04 12:03:09 -04:00
iput ( last_inode ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
last_inode = NULL ;
}
if ( old_cap ) {
2010-06-17 16:16:12 -07:00
ceph_put_cap ( session - > s_mdsc , old_cap ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
old_cap = NULL ;
}
2023-04-19 10:39:14 +08:00
ret = cb ( inode , mds , arg ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
last_inode = inode ;
2009-10-06 11:31:09 -07:00
spin_lock ( & session - > s_cap_lock ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
p = p - > next ;
2017-08-20 20:22:02 +02:00
if ( ! cap - > ci ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " finishing cap %p removal \n " , cap ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
BUG_ON ( cap - > session ! = session ) ;
2015-05-14 17:22:42 +08:00
cap - > session = NULL ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
list_del_init ( & cap - > session_caps ) ;
session - > s_nr_caps - - ;
2020-06-30 03:52:16 -04:00
atomic64_dec ( & session - > s_mdsc - > metric . total_caps ) ;
2019-01-14 17:21:19 +08:00
if ( cap - > queue_release )
__ceph_queue_cap_release ( session , cap ) ;
else
2015-05-14 17:22:42 +08:00
old_cap = cap ; /* put_cap it w/o locks held */
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
}
2009-12-21 20:40:34 -08:00
if ( ret < 0 )
goto out ;
2009-10-06 11:31:09 -07:00
}
2009-12-21 20:40:34 -08:00
ret = 0 ;
out :
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
session - > s_cap_iterator = NULL ;
2009-10-06 11:31:09 -07:00
spin_unlock ( & session - > s_cap_lock ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
2021-06-04 12:03:09 -04:00
iput ( last_inode ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
if ( old_cap )
2010-06-17 16:16:12 -07:00
ceph_put_cap ( session - > s_mdsc , old_cap ) ;
ceph: fix iterate_caps removal race
We need to be able to iterate over all caps on a session with a
possibly slow callback on each cap. To allow this, we used to
prevent cap reordering while we were iterating. However, we were
not safe from races with removal: removing the 'next' cap would
make the next pointer from list_for_each_entry_safe be invalid,
and cause a lock up or similar badness.
Instead, we keep an iterator pointer in the session pointing to
the current cap. As before, we avoid reordering. For removal,
if the cap isn't the current cap we are iterating over, we are
fine. If it is, we clear cap->ci (to mark the cap as pending
removal) but leave it in the session list. In iterate_caps, we
can safely finish removal and get the next cap pointer.
While we're at it, clean up put_cap to not take a cap reservation
context, as it was never used.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-02-16 11:39:45 -08:00
2009-12-21 20:40:34 -08:00
return ret ;
2009-10-06 11:31:09 -07:00
}
2023-04-19 10:39:14 +08:00
static int remove_session_caps_cb ( struct inode * inode , int mds , void * arg )
2009-10-06 11:31:09 -07:00
{
struct ceph_inode_info * ci = ceph_inode ( inode ) ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = ceph_inode_to_client ( inode ) ;
2016-04-15 13:56:12 +08:00
bool invalidate = false ;
2023-04-19 10:39:14 +08:00
struct ceph_cap * cap ;
int iputs = 0 ;
2010-05-10 16:12:25 -07:00
2011-11-30 09:47:09 -08:00
spin_lock ( & ci - > i_ceph_lock ) ;
2023-04-19 10:39:14 +08:00
cap = __get_cap_for_mds ( ci , mds ) ;
if ( cap ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " removing cap %p, ci is %p, inode is %p \n " ,
cap , ci , & ci - > netfs . inode ) ;
2023-04-19 10:39:14 +08:00
iputs = ceph_purge_inode_cap ( inode , cap , & invalidate ) ;
}
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2016-04-08 15:27:16 +08:00
2023-04-19 10:39:14 +08:00
if ( cap )
wake_up_all ( & ci - > i_cap_wq ) ;
2016-04-15 13:56:12 +08:00
if ( invalidate )
ceph_queue_invalidate ( inode ) ;
2021-09-02 13:06:57 -04:00
while ( iputs - - )
2021-08-25 21:45:43 +08:00
iput ( inode ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
/*
* caller must hold session s_mutex
*/
static void remove_session_caps ( struct ceph_mds_session * session )
{
2016-04-15 13:56:12 +08:00
struct ceph_fs_client * fsc = session - > s_mdsc - > fsc ;
struct super_block * sb = fsc - > sb ;
2017-10-19 08:53:58 -04:00
LIST_HEAD ( dispose ) ;
2023-06-12 09:04:07 +08:00
doutc ( fsc - > client , " on %p \n " , session ) ;
2019-04-24 12:09:04 -04:00
ceph_iterate_session_caps ( session , remove_session_caps_cb , fsc ) ;
2013-07-24 12:22:11 +08:00
2016-07-07 15:22:38 +08:00
wake_up_all ( & fsc - > mdsc - > cap_flushing_wq ) ;
2013-07-24 12:22:11 +08:00
spin_lock ( & session - > s_cap_lock ) ;
if ( session - > s_nr_caps > 0 ) {
struct inode * inode ;
struct ceph_cap * cap , * prev = NULL ;
struct ceph_vino vino ;
/*
* iterate_session_caps ( ) skips inodes that are being
* deleted , we need to wait until deletions are complete .
* __wait_on_freeing_inode ( ) is designed for the job ,
* but it is not exported , so use lookup inode function
* to access it .
*/
while ( ! list_empty ( & session - > s_caps ) ) {
cap = list_entry ( session - > s_caps . next ,
struct ceph_cap , session_caps ) ;
if ( cap = = prev )
break ;
prev = cap ;
vino = cap - > ci - > i_vino ;
spin_unlock ( & session - > s_cap_lock ) ;
2013-09-02 15:19:53 +08:00
inode = ceph_find_inode ( sb , vino ) ;
2021-06-04 12:03:09 -04:00
iput ( inode ) ;
2013-07-24 12:22:11 +08:00
spin_lock ( & session - > s_cap_lock ) ;
}
}
2015-05-14 17:22:42 +08:00
// drop cap expires and unlock s_cap_lock
2017-10-19 08:53:58 -04:00
detach_cap_releases ( session , & dispose ) ;
2013-07-24 12:22:11 +08:00
2009-10-06 11:31:09 -07:00
BUG_ON ( session - > s_nr_caps > 0 ) ;
2010-05-10 16:12:25 -07:00
BUG_ON ( ! list_empty ( & session - > s_cap_flushing ) ) ;
2017-10-19 08:53:58 -04:00
spin_unlock ( & session - > s_cap_lock ) ;
dispose_cap_releases ( session - > s_mdsc , & dispose ) ;
2009-10-06 11:31:09 -07:00
}
2018-12-10 16:35:09 +08:00
enum {
RECONNECT ,
RENEWCAPS ,
FORCE_RO ,
} ;
2009-10-06 11:31:09 -07:00
/*
* wake up any threads waiting on this session ' s caps . if the cap is
* old ( didn ' t get renewed on the client reconnect ) , remove it now .
*
* caller must hold s_mutex .
*/
2023-04-19 10:39:14 +08:00
static int wake_up_session_cb ( struct inode * inode , int mds , void * arg )
2009-10-06 11:31:09 -07:00
{
2009-11-20 13:43:45 -08:00
struct ceph_inode_info * ci = ceph_inode ( inode ) ;
2018-12-10 16:35:09 +08:00
unsigned long ev = ( unsigned long ) arg ;
2009-11-20 13:43:45 -08:00
2018-12-10 16:35:09 +08:00
if ( ev = = RECONNECT ) {
2011-11-30 09:47:09 -08:00
spin_lock ( & ci - > i_ceph_lock ) ;
2009-11-20 13:43:45 -08:00
ci - > i_wanted_max_size = 0 ;
ci - > i_requested_max_size = 0 ;
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2018-12-10 16:35:09 +08:00
} else if ( ev = = RENEWCAPS ) {
2023-04-19 10:39:14 +08:00
struct ceph_cap * cap ;
spin_lock ( & ci - > i_ceph_lock ) ;
cap = __get_cap_for_mds ( ci , mds ) ;
/* mds did not re-issue stale cap */
if ( cap & & cap - > cap_gen < atomic_read ( & cap - > session - > s_cap_gen ) )
2018-12-10 16:35:09 +08:00
cap - > issued = cap - > implemented = CEPH_CAP_PIN ;
2023-04-19 10:39:14 +08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2018-12-10 16:35:09 +08:00
} else if ( ev = = FORCE_RO ) {
2009-11-20 13:43:45 -08:00
}
2016-05-19 19:15:19 +08:00
wake_up_all ( & ci - > i_cap_wq ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
2018-12-10 16:35:09 +08:00
static void wake_up_session_caps ( struct ceph_mds_session * session , int ev )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = session - > s_mdsc - > fsc - > client ;
doutc ( cl , " session %p mds%d \n " , session , session - > s_mds ) ;
2019-04-24 12:09:04 -04:00
ceph_iterate_session_caps ( session , wake_up_session_cb ,
( void * ) ( unsigned long ) ev ) ;
2009-10-06 11:31:09 -07:00
}
/*
* Send periodic message to MDS renewing all currently held caps . The
* ack will reset the expiration for all caps from this session .
*
* caller holds s_mutex
*/
static int send_renew_caps ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * msg ;
int state ;
if ( time_after_eq ( jiffies , session - > s_cap_ttl ) & &
time_after_eq ( session - > s_cap_ttl , session - > s_renew_requested ) )
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d caps stale \n " , session - > s_mds ) ;
2010-03-18 13:43:09 -07:00
session - > s_renew_requested = jiffies ;
2009-10-06 11:31:09 -07:00
/* do not try to renew caps until a recovering mds has reconnected
* with its clients . */
state = ceph_mdsmap_get_state ( mdsc - > mdsmap , session - > s_mds ) ;
if ( state < CEPH_MDS_STATE_RECONNECT ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " ignoring mds%d (%s) \n " , session - > s_mds ,
ceph_mds_state_name ( state ) ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " to mds%d (%s) \n " , session - > s_mds ,
ceph_mds_state_name ( state ) ) ;
2021-07-05 09:22:54 +08:00
msg = ceph_create_session_msg ( CEPH_SESSION_REQUEST_RENEWCAPS ,
+ + session - > s_renew_seq ) ;
2010-04-01 16:06:19 -07:00
if ( ! msg )
return - ENOMEM ;
2009-10-06 11:31:09 -07:00
ceph_con_send ( & session - > s_con , msg ) ;
return 0 ;
}
2013-11-22 14:48:37 +08:00
static int send_flushmsg_ack ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session , u64 seq )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2013-11-22 14:48:37 +08:00
struct ceph_msg * msg ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " to mds%d (%s)s seq %lld \n " , session - > s_mds ,
ceph_session_state_name ( session - > s_state ) , seq ) ;
2021-07-05 09:22:54 +08:00
msg = ceph_create_session_msg ( CEPH_SESSION_FLUSHMSG_ACK , seq ) ;
2013-11-22 14:48:37 +08:00
if ( ! msg )
return - ENOMEM ;
ceph_con_send ( & session - > s_con , msg ) ;
return 0 ;
}
2009-10-06 11:31:09 -07:00
/*
* Note new cap ttl , and any transition from stale - > not stale ( fresh ? ) .
2009-11-20 13:43:45 -08:00
*
* Called under session - > s_mutex
2009-10-06 11:31:09 -07:00
*/
static void renewed_caps ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session , int is_renew )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
int was_stale ;
int wake = 0 ;
spin_lock ( & session - > s_cap_lock ) ;
2012-01-12 17:48:11 -08:00
was_stale = is_renew & & time_after_eq ( jiffies , session - > s_cap_ttl ) ;
2009-10-06 11:31:09 -07:00
session - > s_cap_ttl = session - > s_renew_requested +
mdsc - > mdsmap - > m_session_timeout * HZ ;
if ( was_stale ) {
if ( time_before ( jiffies , session - > s_cap_ttl ) ) {
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d caps renewed \n " ,
session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
wake = 1 ;
} else {
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d caps still stale \n " ,
session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
}
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d ttl now %lu, was %s, now %s \n " , session - > s_mds ,
session - > s_cap_ttl , was_stale ? " stale " : " fresh " ,
time_before ( jiffies , session - > s_cap_ttl ) ? " stale " : " fresh " ) ;
2009-10-06 11:31:09 -07:00
spin_unlock ( & session - > s_cap_lock ) ;
if ( wake )
2018-12-10 16:35:09 +08:00
wake_up_session_caps ( session , RENEWCAPS ) ;
2009-10-06 11:31:09 -07:00
}
/*
* send a session close request
*/
2020-06-30 03:52:15 -04:00
static int request_close_session ( struct ceph_mds_session * session )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = session - > s_mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * msg ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d state %s seq %lld \n " , session - > s_mds ,
ceph_session_state_name ( session - > s_state ) , session - > s_seq ) ;
2021-07-05 09:22:54 +08:00
msg = ceph_create_session_msg ( CEPH_SESSION_REQUEST_CLOSE ,
session - > s_seq ) ;
2010-04-01 16:06:19 -07:00
if ( ! msg )
return - ENOMEM ;
ceph_con_send ( & session - > s_con , msg ) ;
2016-09-14 16:39:51 +08:00
return 1 ;
2009-10-06 11:31:09 -07:00
}
/*
* Called with s_mutex held .
*/
static int __close_session ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
if ( session - > s_state > = CEPH_MDS_SESSION_CLOSING )
return 0 ;
session - > s_state = CEPH_MDS_SESSION_CLOSING ;
2020-06-30 03:52:15 -04:00
return request_close_session ( session ) ;
2009-10-06 11:31:09 -07:00
}
2017-11-30 11:59:22 +08:00
static bool drop_negative_children ( struct dentry * dentry )
{
struct dentry * child ;
bool all_negative = true ;
if ( ! d_is_dir ( dentry ) )
goto out ;
spin_lock ( & dentry - > d_lock ) ;
2023-11-07 02:00:39 -05:00
hlist_for_each_entry ( child , & dentry - > d_children , d_sib ) {
2017-11-30 11:59:22 +08:00
if ( d_really_is_positive ( child ) ) {
all_negative = false ;
break ;
}
}
spin_unlock ( & dentry - > d_lock ) ;
if ( all_negative )
shrink_dcache_parent ( dentry ) ;
out :
return all_negative ;
}
2009-10-06 11:31:09 -07:00
/*
* Trim old ( er ) caps .
*
* Because we can ' t cache an inode without one or more caps , we do
* this indirectly : if a cap is unused , we prune its aliases , at which
* point the inode will hopefully get dropped to .
*
* Yes , this is a bit sloppy . Our only real goal here is to respond to
* memory pressure from the MDS , though , so it needn ' t be perfect .
*/
2023-04-19 10:39:14 +08:00
static int trim_caps_cb ( struct inode * inode , int mds , void * arg )
2009-10-06 11:31:09 -07:00
{
2023-06-09 15:15:47 +08:00
struct ceph_mds_client * mdsc = ceph_sb_to_mdsc ( inode - > i_sb ) ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-07-19 15:22:28 -04:00
int * remaining = arg ;
2009-10-06 11:31:09 -07:00
struct ceph_inode_info * ci = ceph_inode ( inode ) ;
2013-11-22 13:56:24 +08:00
int used , wanted , oissued , mine ;
2023-04-19 10:39:14 +08:00
struct ceph_cap * cap ;
2009-10-06 11:31:09 -07:00
2019-07-19 15:22:28 -04:00
if ( * remaining < = 0 )
2009-10-06 11:31:09 -07:00
return - 1 ;
2011-11-30 09:47:09 -08:00
spin_lock ( & ci - > i_ceph_lock ) ;
2023-04-19 10:39:14 +08:00
cap = __get_cap_for_mds ( ci , mds ) ;
if ( ! cap ) {
spin_unlock ( & ci - > i_ceph_lock ) ;
return 0 ;
}
2009-10-06 11:31:09 -07:00
mine = cap - > issued | cap - > implemented ;
used = __ceph_caps_used ( ci ) ;
2013-11-22 13:56:24 +08:00
wanted = __ceph_caps_file_wanted ( ci ) ;
2009-10-06 11:31:09 -07:00
oissued = __ceph_caps_issued_other ( ci , cap ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx cap %p mine %s oissued %s used %s wanted %s \n " ,
inode , ceph_vinop ( inode ) , cap , ceph_cap_string ( mine ) ,
ceph_cap_string ( oissued ) , ceph_cap_string ( used ) ,
ceph_cap_string ( wanted ) ) ;
2013-11-22 13:56:24 +08:00
if ( cap = = ci - > i_auth_cap ) {
2015-05-07 10:59:47 +08:00
if ( ci - > i_dirty_caps | | ci - > i_flushing_caps | |
! list_empty ( & ci - > i_cap_snaps ) )
2013-11-22 13:56:24 +08:00
goto out ;
if ( ( used | wanted ) & CEPH_CAP_ANY_WR )
goto out ;
2017-09-08 15:23:18 +08:00
/* Note: it's possible that i_filelock_ref becomes non-zero
* after dropping auth caps . It doesn ' t hurt because reply
* of lock mds request will re - add auth caps . */
if ( atomic_read ( & ci - > i_filelock_ref ) > 0 )
goto out ;
2013-11-22 13:56:24 +08:00
}
2015-10-26 16:08:43 +08:00
/* The inode has cached pages, but it's no longer used.
* we can safely drop it */
2019-05-11 17:27:59 +08:00
if ( S_ISREG ( inode - > i_mode ) & &
wanted = = 0 & & used = = CEPH_CAP_FILE_CACHE & &
2015-10-26 16:08:43 +08:00
! ( oissued & CEPH_CAP_FILE_CACHE ) ) {
used = 0 ;
oissued = 0 ;
}
2013-11-22 13:56:24 +08:00
if ( ( used | wanted ) & ~ oissued & mine )
2009-10-06 11:31:09 -07:00
goto out ; /* we need these caps */
if ( oissued ) {
/* we aren't the only cap.. just remove us */
2023-06-09 15:15:47 +08:00
ceph_remove_cap ( mdsc , cap , true ) ;
2019-07-19 15:22:28 -04:00
( * remaining ) - - ;
2009-10-06 11:31:09 -07:00
} else {
2017-11-30 11:59:22 +08:00
struct dentry * dentry ;
2015-10-26 16:08:43 +08:00
/* try dropping referring dentries */
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2017-11-30 11:59:22 +08:00
dentry = d_find_any_alias ( inode ) ;
if ( dentry & & drop_negative_children ( dentry ) ) {
int count ;
dput ( dentry ) ;
d_prune_aliases ( inode ) ;
count = atomic_read ( & inode - > i_count ) ;
if ( count = = 1 )
2019-07-19 15:22:28 -04:00
( * remaining ) - - ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p %llx.%llx cap %p pruned, count now %d \n " ,
inode , ceph_vinop ( inode ) , cap , count ) ;
2017-11-30 11:59:22 +08:00
} else {
dput ( dentry ) ;
}
2009-10-06 11:31:09 -07:00
return 0 ;
}
out :
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
/*
* Trim session cap count down to some max number .
*/
2018-01-24 21:24:33 +08:00
int ceph_trim_caps ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session ,
int max_caps )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
int trim_caps = session - > s_nr_caps - max_caps ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d start: %d / %d, trim %d \n " , session - > s_mds ,
session - > s_nr_caps , max_caps , trim_caps ) ;
2009-10-06 11:31:09 -07:00
if ( trim_caps > 0 ) {
2019-07-19 15:22:28 -04:00
int remaining = trim_caps ;
ceph_iterate_session_caps ( session , trim_caps_cb , & remaining ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d done: %d / %d, trimmed %d \n " ,
session - > s_mds , session - > s_nr_caps , max_caps ,
trim_caps - remaining ) ;
2009-10-06 11:31:09 -07:00
}
2014-04-01 20:34:56 +08:00
2019-01-14 17:21:19 +08:00
ceph_flush_cap_releases ( mdsc , session ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
2015-06-09 17:20:12 +08:00
static int check_caps_flush ( struct ceph_mds_client * mdsc ,
u64 want_flush_tid )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2015-06-09 17:20:12 +08:00
int ret = 1 ;
spin_lock ( & mdsc - > cap_dirty_lock ) ;
2016-07-06 11:12:56 +08:00
if ( ! list_empty ( & mdsc - > cap_flush_list ) ) {
struct ceph_cap_flush * cf =
list_first_entry ( & mdsc - > cap_flush_list ,
struct ceph_cap_flush , g_list ) ;
if ( cf - > tid < = want_flush_tid ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " still flushing tid %llu <= %llu \n " ,
cf - > tid , want_flush_tid ) ;
2016-07-06 11:12:56 +08:00
ret = 0 ;
}
2015-06-09 17:20:12 +08:00
}
spin_unlock ( & mdsc - > cap_dirty_lock ) ;
return ret ;
2015-01-08 21:30:12 +08:00
}
2009-10-06 11:31:09 -07:00
/*
* flush all dirty inode data to disk .
*
2015-06-09 17:20:12 +08:00
* returns true if we ' ve flushed through want_flush_tid
2009-10-06 11:31:09 -07:00
*/
2015-05-05 21:22:13 +08:00
static void wait_caps_flush ( struct ceph_mds_client * mdsc ,
2016-07-04 18:06:41 +08:00
u64 want_flush_tid )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
doutc ( cl , " want %llu \n " , want_flush_tid ) ;
2015-06-09 17:20:12 +08:00
wait_event ( mdsc - > cap_flushing_wq ,
check_caps_flush ( mdsc , want_flush_tid ) ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " ok, flushed thru %llu \n " , want_flush_tid ) ;
2009-10-06 11:31:09 -07:00
}
/*
* called under s_mutex
*/
2019-01-14 17:21:19 +08:00
static void ceph_send_cap_releases ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2015-05-14 17:22:42 +08:00
struct ceph_msg * msg = NULL ;
struct ceph_mds_cap_release * head ;
struct ceph_mds_cap_item * item ;
2017-04-13 11:07:04 -04:00
struct ceph_osd_client * osdc = & mdsc - > fsc - > client - > osdc ;
2015-05-14 17:22:42 +08:00
struct ceph_cap * cap ;
LIST_HEAD ( tmp_list ) ;
int num_cap_releases ;
2017-04-13 11:07:04 -04:00
__le32 barrier , * cap_barrier ;
down_read ( & osdc - > lock ) ;
barrier = cpu_to_le32 ( osdc - > epoch_barrier ) ;
up_read ( & osdc - > lock ) ;
2009-10-06 11:31:09 -07:00
2010-05-05 15:51:35 -07:00
spin_lock ( & session - > s_cap_lock ) ;
2015-05-14 17:22:42 +08:00
again :
list_splice_init ( & session - > s_cap_releases , & tmp_list ) ;
num_cap_releases = session - > s_num_cap_releases ;
session - > s_num_cap_releases = 0 ;
2009-10-06 11:31:09 -07:00
spin_unlock ( & session - > s_cap_lock ) ;
2010-05-10 15:36:44 -07:00
2015-05-14 17:22:42 +08:00
while ( ! list_empty ( & tmp_list ) ) {
if ( ! msg ) {
msg = ceph_msg_new ( CEPH_MSG_CLIENT_CAPRELEASE ,
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
PAGE_SIZE , GFP_NOFS , false ) ;
2015-05-14 17:22:42 +08:00
if ( ! msg )
goto out_err ;
head = msg - > front . iov_base ;
head - > num = cpu_to_le32 ( 0 ) ;
msg - > front . iov_len = sizeof ( * head ) ;
2017-04-13 11:07:04 -04:00
msg - > hdr . version = cpu_to_le16 ( 2 ) ;
msg - > hdr . compat_version = cpu_to_le16 ( 1 ) ;
2015-05-14 17:22:42 +08:00
}
2017-04-13 11:07:04 -04:00
2015-05-14 17:22:42 +08:00
cap = list_first_entry ( & tmp_list , struct ceph_cap ,
session_caps ) ;
list_del ( & cap - > session_caps ) ;
num_cap_releases - - ;
2010-05-10 15:36:44 -07:00
2014-03-24 09:56:43 +08:00
head = msg - > front . iov_base ;
2019-05-02 08:06:50 -04:00
put_unaligned_le32 ( get_unaligned_le32 ( & head - > num ) + 1 ,
& head - > num ) ;
2015-05-14 17:22:42 +08:00
item = msg - > front . iov_base + msg - > front . iov_len ;
item - > ino = cpu_to_le64 ( cap - > cap_ino ) ;
item - > cap_id = cpu_to_le64 ( cap - > cap_id ) ;
item - > migrate_seq = cpu_to_le32 ( cap - > mseq ) ;
item - > seq = cpu_to_le32 ( cap - > issue_seq ) ;
msg - > front . iov_len + = sizeof ( * item ) ;
ceph_put_cap ( mdsc , cap ) ;
if ( le32_to_cpu ( head - > num ) = = CEPH_CAPS_PER_RELEASE ) {
2017-04-13 11:07:04 -04:00
// Append cap_barrier field
cap_barrier = msg - > front . iov_base + msg - > front . iov_len ;
* cap_barrier = barrier ;
msg - > front . iov_len + = sizeof ( * cap_barrier ) ;
2015-05-14 17:22:42 +08:00
msg - > hdr . front_len = cpu_to_le32 ( msg - > front . iov_len ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d %p \n " , session - > s_mds , msg ) ;
2015-05-14 17:22:42 +08:00
ceph_con_send ( & session - > s_con , msg ) ;
msg = NULL ;
}
2014-03-24 09:56:43 +08:00
}
2010-05-10 15:36:44 -07:00
2015-05-14 17:22:42 +08:00
BUG_ON ( num_cap_releases ! = 0 ) ;
2010-05-10 15:36:44 -07:00
2015-05-14 17:22:42 +08:00
spin_lock ( & session - > s_cap_lock ) ;
if ( ! list_empty ( & session - > s_cap_releases ) )
goto again ;
spin_unlock ( & session - > s_cap_lock ) ;
if ( msg ) {
2017-04-13 11:07:04 -04:00
// Append cap_barrier field
cap_barrier = msg - > front . iov_base + msg - > front . iov_len ;
* cap_barrier = barrier ;
msg - > front . iov_len + = sizeof ( * cap_barrier ) ;
2015-05-14 17:22:42 +08:00
msg - > hdr . front_len = cpu_to_le32 ( msg - > front . iov_len ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d %p \n " , session - > s_mds , msg ) ;
2015-05-14 17:22:42 +08:00
ceph_con_send ( & session - > s_con , msg ) ;
2010-05-10 15:36:44 -07:00
}
2015-05-14 17:22:42 +08:00
return ;
out_err :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " mds%d, failed to allocate message \n " ,
session - > s_mds ) ;
2015-05-14 17:22:42 +08:00
spin_lock ( & session - > s_cap_lock ) ;
list_splice ( & tmp_list , & session - > s_cap_releases ) ;
session - > s_num_cap_releases + = num_cap_releases ;
spin_unlock ( & session - > s_cap_lock ) ;
2010-05-10 15:36:44 -07:00
}
2019-01-14 17:21:19 +08:00
static void ceph_cap_release_work ( struct work_struct * work )
{
struct ceph_mds_session * session =
container_of ( work , struct ceph_mds_session , s_cap_release_work ) ;
mutex_lock ( & session - > s_mutex ) ;
if ( session - > s_state = = CEPH_MDS_SESSION_OPEN | |
session - > s_state = = CEPH_MDS_SESSION_HUNG )
ceph_send_cap_releases ( session - > s_mdsc , session ) ;
mutex_unlock ( & session - > s_mutex ) ;
ceph_put_mds_session ( session ) ;
}
void ceph_flush_cap_releases ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-01-14 17:21:19 +08:00
if ( mdsc - > stopping )
return ;
2019-12-19 19:44:09 -05:00
ceph_get_mds_session ( session ) ;
2019-01-14 17:21:19 +08:00
if ( queue_work ( mdsc - > fsc - > cap_wq ,
& session - > s_cap_release_work ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " cap release work queued \n " ) ;
2019-01-14 17:21:19 +08:00
} else {
ceph_put_mds_session ( session ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " failed to queue cap release work \n " ) ;
2019-01-14 17:21:19 +08:00
}
}
/*
* caller holds session - > s_cap_lock
*/
void __ceph_queue_cap_release ( struct ceph_mds_session * session ,
struct ceph_cap * cap )
{
list_add_tail ( & cap - > session_caps , & session - > s_cap_releases ) ;
session - > s_num_cap_releases + + ;
if ( ! ( session - > s_num_cap_releases % CEPH_CAPS_PER_RELEASE ) )
ceph_flush_cap_releases ( session - > s_mdsc , session ) ;
}
2019-01-31 16:55:51 +08:00
static void ceph_cap_reclaim_work ( struct work_struct * work )
{
struct ceph_mds_client * mdsc =
container_of ( work , struct ceph_mds_client , cap_reclaim_work ) ;
int ret = ceph_trim_dentries ( mdsc ) ;
if ( ret = = - EAGAIN )
ceph_queue_cap_reclaim_work ( mdsc ) ;
}
void ceph_queue_cap_reclaim_work ( struct ceph_mds_client * mdsc )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-01-31 16:55:51 +08:00
if ( mdsc - > stopping )
return ;
if ( queue_work ( mdsc - > fsc - > cap_wq , & mdsc - > cap_reclaim_work ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " caps reclaim work queued \n " ) ;
2019-01-31 16:55:51 +08:00
} else {
2023-06-12 09:04:07 +08:00
doutc ( cl , " failed to queue caps release work \n " ) ;
2019-01-31 16:55:51 +08:00
}
}
2019-02-01 14:57:15 +08:00
void ceph_reclaim_caps_nr ( struct ceph_mds_client * mdsc , int nr )
{
int val ;
if ( ! nr )
return ;
val = atomic_add_return ( nr , & mdsc - > cap_reclaim_pending ) ;
2019-11-26 07:32:22 -05:00
if ( ( val % CEPH_CAPS_PER_RELEASE ) < nr ) {
2019-02-01 14:57:15 +08:00
atomic_set ( & mdsc - > cap_reclaim_pending , 0 ) ;
ceph_queue_cap_reclaim_work ( mdsc ) ;
}
}
2009-10-06 11:31:09 -07:00
/*
* requests
*/
2014-03-29 13:41:15 +08:00
int ceph_alloc_readdir_reply_buffer ( struct ceph_mds_request * req ,
struct inode * dir )
{
struct ceph_inode_info * ci = ceph_inode ( dir ) ;
struct ceph_mds_reply_info_parsed * rinfo = & req - > r_reply_info ;
struct ceph_mount_options * opt = req - > r_mdsc - > fsc - > mount_options ;
2016-04-28 09:37:39 +08:00
size_t size = sizeof ( struct ceph_mds_reply_dir_entry ) ;
2019-09-09 15:58:55 -04:00
unsigned int num_entries ;
int order ;
2014-03-29 13:41:15 +08:00
spin_lock ( & ci - > i_ceph_lock ) ;
num_entries = ci - > i_files + ci - > i_subdirs ;
spin_unlock ( & ci - > i_ceph_lock ) ;
2019-09-09 15:58:55 -04:00
num_entries = max ( num_entries , 1U ) ;
2014-03-29 13:41:15 +08:00
num_entries = min ( num_entries , opt - > max_readdir ) ;
order = get_order ( size * num_entries ) ;
while ( order > = 0 ) {
2016-04-28 09:37:39 +08:00
rinfo - > dir_entries = ( void * ) __get_free_pages ( GFP_KERNEL |
2022-02-17 16:15:42 +08:00
__GFP_NOWARN |
__GFP_ZERO ,
2016-04-28 09:37:39 +08:00
order ) ;
if ( rinfo - > dir_entries )
2014-03-29 13:41:15 +08:00
break ;
order - - ;
}
2016-04-28 09:37:39 +08:00
if ( ! rinfo - > dir_entries )
2014-03-29 13:41:15 +08:00
return - ENOMEM ;
num_entries = ( PAGE_SIZE < < order ) / size ;
num_entries = min ( num_entries , opt - > max_readdir ) ;
rinfo - > dir_buf_size = PAGE_SIZE < < order ;
req - > r_num_caps = num_entries + 1 ;
req - > r_args . readdir . max_entries = cpu_to_le32 ( num_entries ) ;
req - > r_args . readdir . max_bytes = cpu_to_le32 ( opt - > max_readdir_bytes ) ;
return 0 ;
}
2009-10-06 11:31:09 -07:00
/*
* Create an mds request .
*/
struct ceph_mds_request *
ceph_mdsc_create_request ( struct ceph_mds_client * mdsc , int op , int mode )
{
2020-02-17 18:38:37 -05:00
struct ceph_mds_request * req ;
2009-10-06 11:31:09 -07:00
2020-02-17 18:38:37 -05:00
req = kmem_cache_zalloc ( ceph_mds_request_cachep , GFP_NOFS ) ;
2009-10-06 11:31:09 -07:00
if ( ! req )
return ERR_PTR ( - ENOMEM ) ;
2010-05-13 12:01:13 -07:00
mutex_init ( & req - > r_fill_mutex ) ;
2010-06-17 16:16:12 -07:00
req - > r_mdsc = mdsc ;
2009-10-06 11:31:09 -07:00
req - > r_started = jiffies ;
2020-03-19 23:45:02 -04:00
req - > r_start_latency = ktime_get ( ) ;
2009-10-06 11:31:09 -07:00
req - > r_resend_mds = - 1 ;
INIT_LIST_HEAD ( & req - > r_unsafe_dir_item ) ;
2015-10-27 18:36:06 +08:00
INIT_LIST_HEAD ( & req - > r_unsafe_target_item ) ;
2009-10-06 11:31:09 -07:00
req - > r_fmode = - 1 ;
2022-07-27 12:29:10 +08:00
req - > r_feature_needed = - 1 ;
2009-12-07 12:31:09 -08:00
kref_init ( & req - > r_kref ) ;
2016-04-28 16:07:22 +02:00
RB_CLEAR_NODE ( & req - > r_node ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & req - > r_wait ) ;
init_completion ( & req - > r_completion ) ;
init_completion ( & req - > r_safe_completion ) ;
INIT_LIST_HEAD ( & req - > r_unsafe_item ) ;
2019-12-02 21:19:42 -08:00
ktime_get_coarse_real_ts64 ( & req - > r_stamp ) ;
2014-05-21 17:41:08 -07:00
2009-10-06 11:31:09 -07:00
req - > r_op = op ;
req - > r_direct_mode = mode ;
return req ;
}
/*
2010-02-15 12:08:46 -08:00
* return oldest ( lowest ) request , tid in request tree , 0 if none .
2009-10-06 11:31:09 -07:00
*
* called under mdsc - > mutex .
*/
2010-02-15 12:08:46 -08:00
static struct ceph_mds_request * __get_oldest_req ( struct ceph_mds_client * mdsc )
{
if ( RB_EMPTY_ROOT ( & mdsc - > request_tree ) )
return NULL ;
return rb_entry ( rb_first ( & mdsc - > request_tree ) ,
struct ceph_mds_request , r_node ) ;
}
2015-05-19 18:54:40 +08:00
static inline u64 __get_oldest_tid ( struct ceph_mds_client * mdsc )
2009-10-06 11:31:09 -07:00
{
2015-05-19 18:54:40 +08:00
return mdsc - > oldest_tid ;
2009-10-06 11:31:09 -07:00
}
2021-01-14 10:39:22 -05:00
# if IS_ENABLED(CONFIG_FS_ENCRYPTION)
static u8 * get_fscrypt_altname ( const struct ceph_mds_request * req , u32 * plen )
{
struct inode * dir = req - > r_parent ;
struct dentry * dentry = req - > r_dentry ;
u8 * cryptbuf = NULL ;
u32 len = 0 ;
int ret = 0 ;
/* only encode if we have parent and dentry */
if ( ! dir | | ! dentry )
goto success ;
/* No-op unless this is encrypted */
if ( ! IS_ENCRYPTED ( dir ) )
goto success ;
2022-11-29 10:39:49 +00:00
ret = ceph_fscrypt_prepare_readdir ( dir ) ;
if ( ret < 0 )
2021-01-14 10:39:22 -05:00
return ERR_PTR ( ret ) ;
/* No key? Just ignore it. */
if ( ! fscrypt_has_encryption_key ( dir ) )
goto success ;
if ( ! fscrypt_fname_encrypted_size ( dir , dentry - > d_name . len , NAME_MAX ,
& len ) ) {
WARN_ON_ONCE ( 1 ) ;
return ERR_PTR ( - ENAMETOOLONG ) ;
}
/* No need to append altname if name is short enough */
if ( len < = CEPH_NOHASH_NAME_MAX ) {
len = 0 ;
goto success ;
}
cryptbuf = kmalloc ( len , GFP_KERNEL ) ;
if ( ! cryptbuf )
return ERR_PTR ( - ENOMEM ) ;
ret = fscrypt_fname_encrypt ( dir , & dentry - > d_name , cryptbuf , len ) ;
if ( ret ) {
kfree ( cryptbuf ) ;
return ERR_PTR ( ret ) ;
}
success :
* plen = len ;
return cryptbuf ;
}
# else
static u8 * get_fscrypt_altname ( const struct ceph_mds_request * req , u32 * plen )
{
* plen = 0 ;
return NULL ;
}
# endif
2020-08-07 09:28:31 -04:00
/**
* ceph_mdsc_build_path - build a path string to a given dentry
2023-06-09 15:15:47 +08:00
* @ mdsc : mds client
2020-08-07 09:28:31 -04:00
* @ dentry : dentry to which path should be built
* @ plen : returned length of string
* @ pbase : returned base inode number
* @ for_wire : is this path going to be sent to the MDS ?
*
* Build a string that represents the path to the dentry . This is mostly called
* for two different purposes :
*
* 1 ) we need to build a path string to send to the MDS ( for_wire = = true )
* 2 ) we need a path string for local presentation ( e . g . debugfs )
* ( for_wire = = false )
2009-10-06 11:31:09 -07:00
*
2020-08-07 09:28:31 -04:00
* The path is built in reverse , starting with the dentry . Walk back up toward
* the root , building the path until the first non - snapped inode is reached
* ( for_wire ) or the root inode is reached ( ! for_wire ) .
2009-10-06 11:31:09 -07:00
*
* Encode hidden . snap dirs as a double / , i . e .
* foo / . snap / bar - > foo //bar
*/
2023-06-09 15:15:47 +08:00
char * ceph_mdsc_build_path ( struct ceph_mds_client * mdsc , struct dentry * dentry ,
int * plen , u64 * pbase , int for_wire )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2020-08-05 14:50:48 -04:00
struct dentry * cur ;
struct inode * inode ;
2009-10-06 11:31:09 -07:00
char * path ;
2019-04-29 12:13:14 -04:00
int pos ;
2011-07-16 23:43:58 -04:00
unsigned seq ;
2019-04-26 13:33:39 -04:00
u64 base ;
2009-10-06 11:31:09 -07:00
2017-08-20 20:22:02 +02:00
if ( ! dentry )
2009-10-06 11:31:09 -07:00
return ERR_PTR ( - EINVAL ) ;
2019-04-29 12:13:14 -04:00
path = __getname ( ) ;
2017-08-20 20:22:02 +02:00
if ( ! path )
2009-10-06 11:31:09 -07:00
return ERR_PTR ( - ENOMEM ) ;
2019-04-29 12:13:14 -04:00
retry :
pos = PATH_MAX - 1 ;
path [ pos ] = ' \0 ' ;
seq = read_seqbegin ( & rename_lock ) ;
2020-08-05 14:50:48 -04:00
cur = dget ( dentry ) ;
2019-04-29 12:13:14 -04:00
for ( ; ; ) {
2020-08-07 09:28:31 -04:00
struct dentry * parent ;
2009-10-06 11:31:09 -07:00
2020-08-05 14:50:48 -04:00
spin_lock ( & cur - > d_lock ) ;
inode = d_inode ( cur ) ;
2009-10-06 11:31:09 -07:00
if ( inode & & ceph_snap ( inode ) = = CEPH_SNAPDIR ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " path+%d: %p SNAPDIR \n " , pos , cur ) ;
2020-08-07 09:28:31 -04:00
spin_unlock ( & cur - > d_lock ) ;
parent = dget_parent ( cur ) ;
} else if ( for_wire & & inode & & dentry ! = cur & &
2009-10-06 11:31:09 -07:00
ceph_snap ( inode ) = = CEPH_NOSNAP ) {
2020-08-05 14:50:48 -04:00
spin_unlock ( & cur - > d_lock ) ;
2019-05-09 07:58:38 -04:00
pos + + ; /* get rid of any prepended '/' */
2009-10-06 11:31:09 -07:00
break ;
2020-08-07 09:28:31 -04:00
} else if ( ! for_wire | | ! IS_ENCRYPTED ( d_inode ( cur - > d_parent ) ) ) {
2020-08-05 14:50:48 -04:00
pos - = cur - > d_name . len ;
2011-07-16 23:43:58 -04:00
if ( pos < 0 ) {
2020-08-05 14:50:48 -04:00
spin_unlock ( & cur - > d_lock ) ;
2009-10-06 11:31:09 -07:00
break ;
2011-07-16 23:43:58 -04:00
}
2020-08-05 14:50:48 -04:00
memcpy ( path + pos , cur - > d_name . name , cur - > d_name . len ) ;
2020-08-07 09:28:31 -04:00
spin_unlock ( & cur - > d_lock ) ;
parent = dget_parent ( cur ) ;
} else {
int len , ret ;
char buf [ NAME_MAX ] ;
/*
* Proactively copy name into buf , in case we need to
* present it as - is .
*/
memcpy ( buf , cur - > d_name . name , cur - > d_name . len ) ;
len = cur - > d_name . len ;
spin_unlock ( & cur - > d_lock ) ;
parent = dget_parent ( cur ) ;
2022-11-29 10:39:49 +00:00
ret = ceph_fscrypt_prepare_readdir ( d_inode ( parent ) ) ;
2020-08-07 09:28:31 -04:00
if ( ret < 0 ) {
dput ( parent ) ;
dput ( cur ) ;
return ERR_PTR ( ret ) ;
}
if ( fscrypt_has_encryption_key ( d_inode ( parent ) ) ) {
len = ceph_encode_encrypted_fname ( d_inode ( parent ) ,
cur , buf ) ;
if ( len < 0 ) {
dput ( parent ) ;
dput ( cur ) ;
return ERR_PTR ( len ) ;
}
}
pos - = len ;
if ( pos < 0 ) {
dput ( parent ) ;
break ;
}
memcpy ( path + pos , buf , len ) ;
2009-10-06 11:31:09 -07:00
}
2020-08-07 09:28:31 -04:00
dput ( cur ) ;
cur = parent ;
2019-04-29 12:13:14 -04:00
/* Are we at the root? */
2020-08-05 14:50:48 -04:00
if ( IS_ROOT ( cur ) )
2019-04-29 12:13:14 -04:00
break ;
/* Are we out of buffer? */
if ( - - pos < 0 )
break ;
path [ pos ] = ' / ' ;
2009-10-06 11:31:09 -07:00
}
2020-08-05 14:50:48 -04:00
inode = d_inode ( cur ) ;
base = inode ? ceph_ino ( inode ) : 0 ;
dput ( cur ) ;
2019-10-16 08:20:17 -04:00
if ( read_seqretry ( & rename_lock , seq ) )
goto retry ;
if ( pos < 0 ) {
/*
* A rename didn ' t occur , but somehow we didn ' t end up where
* we thought we would . Throw a warning and try again .
*/
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl , " did not end path lookup where expected (pos = %d) \n " ,
pos ) ;
2009-10-06 11:31:09 -07:00
goto retry ;
}
2019-04-26 13:33:39 -04:00
* pbase = base ;
2019-04-29 12:13:14 -04:00
* plen = PATH_MAX - 1 - pos ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " on %p %d built %llx '%.*s' \n " , dentry , d_count ( dentry ) ,
base , * plen , path + pos ) ;
2019-04-29 12:13:14 -04:00
return path + pos ;
2009-10-06 11:31:09 -07:00
}
2023-06-09 15:15:47 +08:00
static int build_dentry_path ( struct ceph_mds_client * mdsc , struct dentry * dentry ,
struct inode * dir , const char * * ppath , int * ppathlen ,
u64 * pino , bool * pfreepath , bool parent_locked )
2009-10-06 11:31:09 -07:00
{
char * path ;
2016-12-15 08:37:57 -05:00
rcu_read_lock ( ) ;
2016-12-15 08:37:58 -05:00
if ( ! dir )
dir = d_inode_rcu ( dentry - > d_parent ) ;
2020-08-07 09:28:31 -04:00
if ( dir & & parent_locked & & ceph_snap ( dir ) = = CEPH_NOSNAP & &
! IS_ENCRYPTED ( dir ) ) {
2016-12-15 08:37:57 -05:00
* pino = ceph_ino ( dir ) ;
rcu_read_unlock ( ) ;
2019-04-29 11:51:02 -04:00
* ppath = dentry - > d_name . name ;
* ppathlen = dentry - > d_name . len ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
2016-12-15 08:37:57 -05:00
rcu_read_unlock ( ) ;
2023-06-09 15:15:47 +08:00
path = ceph_mdsc_build_path ( mdsc , dentry , ppathlen , pino , 1 ) ;
2009-10-06 11:31:09 -07:00
if ( IS_ERR ( path ) )
return PTR_ERR ( path ) ;
* ppath = path ;
2019-04-15 12:00:42 -04:00
* pfreepath = true ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
static int build_inode_path ( struct inode * inode ,
const char * * ppath , int * ppathlen , u64 * pino ,
2019-04-15 12:00:42 -04:00
bool * pfreepath )
2009-10-06 11:31:09 -07:00
{
2023-06-09 15:15:47 +08:00
struct ceph_mds_client * mdsc = ceph_sb_to_mdsc ( inode - > i_sb ) ;
2009-10-06 11:31:09 -07:00
struct dentry * dentry ;
char * path ;
if ( ceph_snap ( inode ) = = CEPH_NOSNAP ) {
* pino = ceph_ino ( inode ) ;
* ppathlen = 0 ;
return 0 ;
}
dentry = d_find_alias ( inode ) ;
2023-06-09 15:15:47 +08:00
path = ceph_mdsc_build_path ( mdsc , dentry , ppathlen , pino , 1 ) ;
2009-10-06 11:31:09 -07:00
dput ( dentry ) ;
if ( IS_ERR ( path ) )
return PTR_ERR ( path ) ;
* ppath = path ;
2019-04-15 12:00:42 -04:00
* pfreepath = true ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
/*
* request arguments may be specified via an inode * , a dentry * , or
* an explicit ino + path .
*/
2023-06-09 15:15:47 +08:00
static int set_request_path_attr ( struct ceph_mds_client * mdsc , struct inode * rinode ,
struct dentry * rdentry , struct inode * rdiri ,
const char * rpath , u64 rino , const char * * ppath ,
int * pathlen , u64 * ino , bool * freepath ,
bool parent_locked )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
int r = 0 ;
if ( rinode ) {
r = build_inode_path ( rinode , ppath , pathlen , ino , freepath ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " inode %p %llx.%llx \n " , rinode , ceph_ino ( rinode ) ,
ceph_snap ( rinode ) ) ;
2009-10-06 11:31:09 -07:00
} else if ( rdentry ) {
2023-06-09 15:15:47 +08:00
r = build_dentry_path ( mdsc , rdentry , rdiri , ppath , pathlen , ino ,
2019-04-15 12:00:42 -04:00
freepath , parent_locked ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " dentry %p %llx/%.*s \n " , rdentry , * ino , * pathlen , * ppath ) ;
2011-08-15 13:02:37 -07:00
} else if ( rpath | | rino ) {
2009-10-06 11:31:09 -07:00
* ino = rino ;
* ppath = rpath ;
2012-10-25 10:23:46 -07:00
* pathlen = rpath ? strlen ( rpath ) : 0 ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " path %.*s \n " , * pathlen , rpath ) ;
2009-10-06 11:31:09 -07:00
}
return r ;
}
2020-07-27 10:16:09 -04:00
static void encode_mclientrequest_tail ( void * * p ,
const struct ceph_mds_request * req )
2020-12-16 17:19:58 +01:00
{
struct ceph_timespec ts ;
int i ;
ceph_encode_timespec64 ( & ts , & req - > r_stamp ) ;
ceph_encode_copy ( p , & ts , sizeof ( ts ) ) ;
2021-01-14 10:39:22 -05:00
/* v4: gid_list */
2020-12-16 17:19:58 +01:00
ceph_encode_32 ( p , req - > r_cred - > group_info - > ngroups ) ;
for ( i = 0 ; i < req - > r_cred - > group_info - > ngroups ; i + + )
ceph_encode_64 ( p , from_kgid ( & init_user_ns ,
req - > r_cred - > group_info - > gid [ i ] ) ) ;
2020-07-27 10:16:09 -04:00
2021-01-14 10:39:22 -05:00
/* v5: altname */
ceph_encode_32 ( p , req - > r_altname_len ) ;
ceph_encode_copy ( p , req - > r_altname , req - > r_altname_len ) ;
2020-07-27 10:16:09 -04:00
/* v6: fscrypt_auth and fscrypt_file */
if ( req - > r_fscrypt_auth ) {
u32 authlen = ceph_fscrypt_auth_len ( req - > r_fscrypt_auth ) ;
ceph_encode_32 ( p , authlen ) ;
ceph_encode_copy ( p , req - > r_fscrypt_auth , authlen ) ;
} else {
ceph_encode_32 ( p , 0 ) ;
}
2022-08-25 09:31:06 -04:00
if ( test_bit ( CEPH_MDS_R_FSCRYPT_FILE , & req - > r_req_flags ) ) {
ceph_encode_32 ( p , sizeof ( __le64 ) ) ;
ceph_encode_64 ( p , req - > r_fscrypt_file ) ;
} else {
ceph_encode_32 ( p , 0 ) ;
}
2020-12-16 17:19:58 +01:00
}
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
static inline u16 mds_supported_head_version ( struct ceph_mds_session * session )
{
if ( ! test_bit ( CEPHFS_FEATURE_32BITS_RETRY_FWD , & session - > s_features ) )
return 1 ;
if ( ! test_bit ( CEPHFS_FEATURE_HAS_OWNER_UIDGID , & session - > s_features ) )
return 2 ;
return CEPH_MDS_REQUEST_HEAD_VERSION ;
}
2023-07-25 17:51:59 +08:00
static struct ceph_mds_request_head_legacy *
find_legacy_request_head ( void * p , u64 features )
{
bool legacy = ! ( features & CEPH_FEATURE_FS_BTIME ) ;
struct ceph_mds_request_head_old * ohead ;
if ( legacy )
return ( struct ceph_mds_request_head_legacy * ) p ;
ohead = ( struct ceph_mds_request_head_old * ) p ;
return ( struct ceph_mds_request_head_legacy * ) & ohead - > oldest_client_tid ;
}
2009-10-06 11:31:09 -07:00
/*
* called under mdsc - > mutex
*/
2020-12-09 10:12:59 -05:00
static struct ceph_msg * create_request_message ( struct ceph_mds_session * session ,
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req ,
2020-12-09 10:12:59 -05:00
bool drop_cap_releases )
2009-10-06 11:31:09 -07:00
{
2020-12-09 10:12:59 -05:00
int mds = session - > s_mds ;
struct ceph_mds_client * mdsc = session - > s_mdsc ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * msg ;
2023-07-25 17:51:59 +08:00
struct ceph_mds_request_head_legacy * lhead ;
2009-10-06 11:31:09 -07:00
const char * path1 = NULL ;
const char * path2 = NULL ;
u64 ino1 = 0 , ino2 = 0 ;
int pathlen1 = 0 , pathlen2 = 0 ;
2019-04-15 12:00:42 -04:00
bool freepath1 = false , freepath2 = false ;
2023-04-26 10:38:57 +08:00
struct dentry * old_dentry = NULL ;
2020-12-16 17:19:58 +01:00
int len ;
2009-10-06 11:31:09 -07:00
u16 releases ;
void * p , * end ;
int ret ;
2020-12-09 10:12:59 -05:00
bool legacy = ! ( session - > s_con . peer_features & CEPH_FEATURE_FS_BTIME ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
u16 request_head_version = mds_supported_head_version ( session ) ;
2023-08-07 15:26:18 +02:00
kuid_t caller_fsuid = req - > r_cred - > fsuid ;
kgid_t caller_fsgid = req - > r_cred - > fsgid ;
2009-10-06 11:31:09 -07:00
2023-06-09 15:15:47 +08:00
ret = set_request_path_attr ( mdsc , req - > r_inode , req - > r_dentry ,
2017-01-31 10:28:26 -05:00
req - > r_parent , req - > r_path1 , req - > r_ino1 . ino ,
2019-04-15 12:00:42 -04:00
& path1 , & pathlen1 , & ino1 , & freepath1 ,
test_bit ( CEPH_MDS_R_PARENT_LOCKED ,
& req - > r_req_flags ) ) ;
2009-10-06 11:31:09 -07:00
if ( ret < 0 ) {
msg = ERR_PTR ( ret ) ;
goto out ;
}
2019-04-15 12:00:42 -04:00
/* If r_old_dentry is set, then assume that its parent is locked */
2023-04-26 10:38:57 +08:00
if ( req - > r_old_dentry & &
! ( req - > r_old_dentry - > d_flags & DCACHE_DISCONNECTED ) )
old_dentry = req - > r_old_dentry ;
2023-06-09 15:15:47 +08:00
ret = set_request_path_attr ( mdsc , NULL , old_dentry ,
2016-12-15 08:37:58 -05:00
req - > r_old_dentry_dir ,
2009-10-06 11:31:09 -07:00
req - > r_path2 , req - > r_ino2 . ino ,
2019-04-15 12:00:42 -04:00
& path2 , & pathlen2 , & ino2 , & freepath2 , true ) ;
2009-10-06 11:31:09 -07:00
if ( ret < 0 ) {
msg = ERR_PTR ( ret ) ;
goto out_free1 ;
}
2021-01-14 10:39:22 -05:00
req - > r_altname = get_fscrypt_altname ( req , & req - > r_altname_len ) ;
if ( IS_ERR ( req - > r_altname ) ) {
msg = ERR_CAST ( req - > r_altname ) ;
req - > r_altname = NULL ;
goto out_free2 ;
}
2023-07-25 17:51:59 +08:00
/*
* For old cephs without supporting the 32 bit retry / fwd feature
* it will copy the raw memories directly when decoding the
* requests . While new cephs will decode the head depending the
* version member , so we need to make sure it will be compatible
* with them both .
*/
if ( legacy )
len = sizeof ( struct ceph_mds_request_head_legacy ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
else if ( request_head_version = = 1 )
2023-07-25 17:51:59 +08:00
len = sizeof ( struct ceph_mds_request_head_old ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
else if ( request_head_version = = 2 )
len = offsetofend ( struct ceph_mds_request_head , ext_num_fwd ) ;
2023-07-25 17:51:59 +08:00
else
len = sizeof ( struct ceph_mds_request_head ) ;
2009-10-06 11:31:09 -07:00
2020-07-27 10:16:09 -04:00
/* filepaths */
len + = 2 * ( 1 + sizeof ( u32 ) + sizeof ( u64 ) ) ;
len + = pathlen1 + pathlen2 ;
/* cap releases */
2009-10-06 11:31:09 -07:00
len + = sizeof ( struct ceph_mds_request_release ) *
( ! ! req - > r_inode_drop + ! ! req - > r_dentry_drop +
! ! req - > r_old_inode_drop + ! ! req - > r_old_dentry_drop ) ;
2020-12-09 10:12:59 -05:00
2009-10-06 11:31:09 -07:00
if ( req - > r_dentry_drop )
2019-04-17 14:23:17 -04:00
len + = pathlen1 ;
2009-10-06 11:31:09 -07:00
if ( req - > r_old_dentry_drop )
2019-04-17 14:23:17 -04:00
len + = pathlen2 ;
2009-10-06 11:31:09 -07:00
2020-07-27 10:16:09 -04:00
/* MClientRequest tail */
/* req->r_stamp */
len + = sizeof ( struct ceph_timespec ) ;
/* gid list */
len + = sizeof ( u32 ) + ( sizeof ( u64 ) * req - > r_cred - > group_info - > ngroups ) ;
/* alternate name */
2021-01-14 10:39:22 -05:00
len + = sizeof ( u32 ) + req - > r_altname_len ;
2020-07-27 10:16:09 -04:00
/* fscrypt_auth */
len + = sizeof ( u32 ) ; // fscrypt_auth
if ( req - > r_fscrypt_auth )
len + = ceph_fscrypt_auth_len ( req - > r_fscrypt_auth ) ;
/* fscrypt_file */
len + = sizeof ( u32 ) ;
2022-08-25 09:31:06 -04:00
if ( test_bit ( CEPH_MDS_R_FSCRYPT_FILE , & req - > r_req_flags ) )
len + = sizeof ( __le64 ) ;
2020-07-27 10:16:09 -04:00
2018-10-15 17:38:23 +02:00
msg = ceph_msg_new2 ( CEPH_MSG_CLIENT_REQUEST , len , 1 , GFP_NOFS , false ) ;
2010-04-01 16:06:19 -07:00
if ( ! msg ) {
msg = ERR_PTR ( - ENOMEM ) ;
2009-10-06 11:31:09 -07:00
goto out_free2 ;
2010-04-01 16:06:19 -07:00
}
2009-10-06 11:31:09 -07:00
2009-12-22 11:24:33 -08:00
msg - > hdr . tid = cpu_to_le64 ( req - > r_tid ) ;
2023-07-25 17:51:59 +08:00
lhead = find_legacy_request_head ( msg - > front . iov_base ,
session - > s_con . peer_features ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
if ( ( req - > r_mnt_idmap ! = & nop_mnt_idmap ) & &
! test_bit ( CEPHFS_FEATURE_HAS_OWNER_UIDGID , & session - > s_features ) ) {
WARN_ON_ONCE ( ! IS_CEPH_MDS_OP_NEWINODE ( req - > r_op ) ) ;
2023-08-07 15:26:18 +02:00
if ( enable_unsafe_idmap ) {
pr_warn_once_client ( cl ,
" idmapped mount is used and CEPHFS_FEATURE_HAS_OWNER_UIDGID "
" is not supported by MDS. UID/GID-based restrictions may "
" not work properly. \n " ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
2023-08-07 15:26:18 +02:00
caller_fsuid = from_vfsuid ( req - > r_mnt_idmap , & init_user_ns ,
VFSUIDT_INIT ( req - > r_cred - > fsuid ) ) ;
caller_fsgid = from_vfsgid ( req - > r_mnt_idmap , & init_user_ns ,
VFSGIDT_INIT ( req - > r_cred - > fsgid ) ) ;
} else {
pr_err_ratelimited_client ( cl ,
" idmapped mount is used and CEPHFS_FEATURE_HAS_OWNER_UIDGID "
" is not supported by MDS. Fail request with -EIO. \n " ) ;
ret = - EIO ;
goto out_err ;
}
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
}
2020-12-09 10:12:59 -05:00
/*
2023-07-25 17:51:59 +08:00
* The ceph_mds_request_head_legacy didn ' t contain a version field , and
2020-12-09 10:12:59 -05:00
* one was added when we moved the message version from 3 - > 4.
*/
if ( legacy ) {
msg - > hdr . version = cpu_to_le16 ( 3 ) ;
2023-07-25 17:51:59 +08:00
p = msg - > front . iov_base + sizeof ( * lhead ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
} else if ( request_head_version = = 1 ) {
2023-07-25 17:51:59 +08:00
struct ceph_mds_request_head_old * ohead = msg - > front . iov_base ;
msg - > hdr . version = cpu_to_le16 ( 4 ) ;
ohead - > version = cpu_to_le16 ( 1 ) ;
p = msg - > front . iov_base + sizeof ( * ohead ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
} else if ( request_head_version = = 2 ) {
struct ceph_mds_request_head * nhead = msg - > front . iov_base ;
msg - > hdr . version = cpu_to_le16 ( 6 ) ;
nhead - > version = cpu_to_le16 ( 2 ) ;
p = msg - > front . iov_base + offsetofend ( struct ceph_mds_request_head , ext_num_fwd ) ;
2020-12-09 10:12:59 -05:00
} else {
2023-07-25 17:51:59 +08:00
struct ceph_mds_request_head * nhead = msg - > front . iov_base ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
kuid_t owner_fsuid ;
kgid_t owner_fsgid ;
2020-12-09 10:12:59 -05:00
2020-07-27 10:16:09 -04:00
msg - > hdr . version = cpu_to_le16 ( 6 ) ;
2023-07-25 17:51:59 +08:00
nhead - > version = cpu_to_le16 ( CEPH_MDS_REQUEST_HEAD_VERSION ) ;
ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
[1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/
[2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/
Link: https://github.com/ceph/ceph/pull/52575
Link: https://tracker.ceph.com/issues/62217
Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-07 15:26:17 +02:00
nhead - > struct_len = cpu_to_le32 ( sizeof ( struct ceph_mds_request_head ) ) ;
if ( IS_CEPH_MDS_OP_NEWINODE ( req - > r_op ) ) {
owner_fsuid = from_vfsuid ( req - > r_mnt_idmap , & init_user_ns ,
VFSUIDT_INIT ( req - > r_cred - > fsuid ) ) ;
owner_fsgid = from_vfsgid ( req - > r_mnt_idmap , & init_user_ns ,
VFSGIDT_INIT ( req - > r_cred - > fsgid ) ) ;
nhead - > owner_uid = cpu_to_le32 ( from_kuid ( & init_user_ns , owner_fsuid ) ) ;
nhead - > owner_gid = cpu_to_le32 ( from_kgid ( & init_user_ns , owner_fsgid ) ) ;
} else {
nhead - > owner_uid = cpu_to_le32 ( - 1 ) ;
nhead - > owner_gid = cpu_to_le32 ( - 1 ) ;
}
2023-07-25 17:51:59 +08:00
p = msg - > front . iov_base + sizeof ( * nhead ) ;
2020-12-09 10:12:59 -05:00
}
2009-10-06 11:31:09 -07:00
end = msg - > front . iov_base + msg - > front . iov_len ;
2023-07-25 17:51:59 +08:00
lhead - > mdsmap_epoch = cpu_to_le32 ( mdsc - > mdsmap - > m_epoch ) ;
lhead - > op = cpu_to_le32 ( req - > r_op ) ;
lhead - > caller_uid = cpu_to_le32 ( from_kuid ( & init_user_ns ,
2023-08-07 15:26:18 +02:00
caller_fsuid ) ) ;
2023-07-25 17:51:59 +08:00
lhead - > caller_gid = cpu_to_le32 ( from_kgid ( & init_user_ns ,
2023-08-07 15:26:18 +02:00
caller_fsgid ) ) ;
2023-07-25 17:51:59 +08:00
lhead - > ino = cpu_to_le64 ( req - > r_deleg_ino ) ;
lhead - > args = req - > r_args ;
2009-10-06 11:31:09 -07:00
ceph_encode_filepath ( & p , end , ino1 , path1 ) ;
ceph_encode_filepath ( & p , end , ino2 , path2 ) ;
2010-07-15 14:58:39 -07:00
/* make note of release offset, in case we need to replay */
req - > r_request_release_offset = p - msg - > front . iov_base ;
2009-10-06 11:31:09 -07:00
/* cap releases */
releases = 0 ;
if ( req - > r_inode_drop )
releases + = ceph_encode_inode_release ( & p ,
2015-03-17 22:25:59 +00:00
req - > r_inode ? req - > r_inode : d_inode ( req - > r_dentry ) ,
2020-03-05 20:21:00 +08:00
mds , req - > r_inode_drop , req - > r_inode_unless ,
req - > r_op = = CEPH_MDS_OP_READDIR ) ;
2020-08-07 09:28:31 -04:00
if ( req - > r_dentry_drop ) {
ret = ceph_encode_dentry_release ( & p , req - > r_dentry ,
2017-01-31 10:28:26 -05:00
req - > r_parent , mds , req - > r_dentry_drop ,
2016-12-15 08:37:59 -05:00
req - > r_dentry_unless ) ;
2020-08-07 09:28:31 -04:00
if ( ret < 0 )
goto out_err ;
releases + = ret ;
}
if ( req - > r_old_dentry_drop ) {
ret = ceph_encode_dentry_release ( & p , req - > r_old_dentry ,
2016-12-15 08:37:59 -05:00
req - > r_old_dentry_dir , mds ,
req - > r_old_dentry_drop ,
req - > r_old_dentry_unless ) ;
2020-08-07 09:28:31 -04:00
if ( ret < 0 )
goto out_err ;
releases + = ret ;
}
2009-10-06 11:31:09 -07:00
if ( req - > r_old_inode_drop )
releases + = ceph_encode_inode_release ( & p ,
2015-03-17 22:25:59 +00:00
d_inode ( req - > r_old_dentry ) ,
2009-10-06 11:31:09 -07:00
mds , req - > r_old_inode_drop , req - > r_old_inode_unless , 0 ) ;
2015-02-27 08:54:08 +08:00
if ( drop_cap_releases ) {
releases = 0 ;
p = msg - > front . iov_base + req - > r_request_release_offset ;
}
2023-07-25 17:51:59 +08:00
lhead - > num_releases = cpu_to_le16 ( releases ) ;
2009-10-06 11:31:09 -07:00
2020-07-27 10:16:09 -04:00
encode_mclientrequest_tail ( & p , req ) ;
2020-12-09 10:12:59 -05:00
2020-06-30 03:52:18 -04:00
if ( WARN_ON_ONCE ( p > end ) ) {
ceph_msg_put ( msg ) ;
msg = ERR_PTR ( - ERANGE ) ;
goto out_free2 ;
}
2009-10-06 11:31:09 -07:00
msg - > front . iov_len = p - msg - > front . iov_base ;
msg - > hdr . front_len = cpu_to_le32 ( msg - > front . iov_len ) ;
2014-09-16 19:15:28 +08:00
if ( req - > r_pagelist ) {
struct ceph_pagelist * pagelist = req - > r_pagelist ;
ceph_msg_data_add_pagelist ( msg , pagelist ) ;
msg - > hdr . data_len = cpu_to_le32 ( pagelist - > length ) ;
} else {
msg - > hdr . data_len = 0 ;
2013-03-04 22:29:57 -06:00
}
2013-02-14 12:16:43 -06:00
2009-10-06 11:31:09 -07:00
msg - > hdr . data_off = cpu_to_le16 ( 0 ) ;
out_free2 :
if ( freepath2 )
2019-04-29 12:13:14 -04:00
ceph_mdsc_free_path ( ( char * ) path2 , pathlen2 ) ;
2009-10-06 11:31:09 -07:00
out_free1 :
if ( freepath1 )
2019-04-29 12:13:14 -04:00
ceph_mdsc_free_path ( ( char * ) path1 , pathlen1 ) ;
2009-10-06 11:31:09 -07:00
out :
return msg ;
2020-08-07 09:28:31 -04:00
out_err :
ceph_msg_put ( msg ) ;
msg = ERR_PTR ( ret ) ;
goto out_free2 ;
2009-10-06 11:31:09 -07:00
}
/*
* called under mdsc - > mutex if error , under no mutex if
* success .
*/
static void complete_request ( struct ceph_mds_client * mdsc ,
struct ceph_mds_request * req )
{
2020-03-19 23:45:02 -04:00
req - > r_end_latency = ktime_get ( ) ;
2009-10-06 11:31:09 -07:00
if ( req - > r_callback )
req - > r_callback ( mdsc , req ) ;
2019-04-02 09:43:18 -04:00
complete_all ( & req - > r_completion ) ;
2009-10-06 11:31:09 -07:00
}
/*
* called under mdsc - > mutex
*/
2020-12-09 08:24:18 -05:00
static int __prepare_send_request ( struct ceph_mds_session * session ,
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req ,
2020-12-09 08:24:18 -05:00
bool drop_cap_releases )
2009-10-06 11:31:09 -07:00
{
2020-12-09 08:24:18 -05:00
int mds = session - > s_mds ;
struct ceph_mds_client * mdsc = session - > s_mdsc ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2023-07-25 17:51:59 +08:00
struct ceph_mds_request_head_legacy * lhead ;
struct ceph_mds_request_head * nhead ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * msg ;
2023-07-25 17:51:59 +08:00
int flags = 0 , old_max_retry ;
bool old_version = ! test_bit ( CEPHFS_FEATURE_32BITS_RETRY_FWD ,
& session - > s_features ) ;
2022-03-30 14:39:33 +08:00
/*
2023-07-25 17:51:59 +08:00
* Avoid inifinite retrying after overflow . The client will
* increase the retry count and if the MDS is old version ,
* so we limit to retry at most 256 times .
2022-03-30 14:39:33 +08:00
*/
2023-07-25 17:51:59 +08:00
if ( req - > r_attempts ) {
old_max_retry = sizeof_field ( struct ceph_mds_request_head_old ,
num_retry ) ;
old_max_retry = 1 < < ( old_max_retry * BITS_PER_BYTE ) ;
if ( ( old_version & & req - > r_attempts > = old_max_retry ) | |
( ( uint32_t ) req - > r_attempts > = U32_MAX ) ) {
2023-06-12 09:04:07 +08:00
pr_warn_ratelimited_client ( cl , " request tid %llu seq overflow \n " ,
req - > r_tid ) ;
2023-07-25 17:51:59 +08:00
return - EMULTIHOP ;
}
2022-03-30 14:39:33 +08:00
}
2009-10-06 11:31:09 -07:00
req - > r_attempts + + ;
2010-06-22 15:58:01 -07:00
if ( req - > r_inode ) {
struct ceph_cap * cap =
ceph_get_cap_for_mds ( ceph_inode ( req - > r_inode ) , mds ) ;
if ( cap )
req - > r_sent_on_mseq = cap - > mseq ;
else
req - > r_sent_on_mseq = - 1 ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " %p tid %lld %s (attempt %d) \n " , req , req - > r_tid ,
ceph_mds_op_name ( req - > r_op ) , req - > r_attempts ) ;
2009-10-06 11:31:09 -07:00
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) ) {
2014-07-01 16:54:34 +08:00
void * p ;
2020-12-09 10:12:59 -05:00
2010-07-15 13:24:32 -07:00
/*
* Replay . Do not regenerate message ( and rebuild
* paths , etc . ) ; just use the original message .
* Rebuilding paths will break for renames because
* d_move mangles the src name .
*/
msg = req - > r_request ;
2023-07-25 17:51:59 +08:00
lhead = find_legacy_request_head ( msg - > front . iov_base ,
session - > s_con . peer_features ) ;
2010-07-15 13:24:32 -07:00
2023-07-25 17:51:59 +08:00
flags = le32_to_cpu ( lhead - > flags ) ;
2010-07-15 13:24:32 -07:00
flags | = CEPH_MDS_FLAG_REPLAY ;
2023-07-25 17:51:59 +08:00
lhead - > flags = cpu_to_le32 ( flags ) ;
2010-07-15 13:24:32 -07:00
if ( req - > r_target_inode )
2023-07-25 17:51:59 +08:00
lhead - > ino = cpu_to_le64 ( ceph_ino ( req - > r_target_inode ) ) ;
2010-07-15 13:24:32 -07:00
2023-07-25 17:51:59 +08:00
lhead - > num_retry = req - > r_attempts - 1 ;
if ( ! old_version ) {
nhead = ( struct ceph_mds_request_head * ) msg - > front . iov_base ;
nhead - > ext_num_retry = cpu_to_le32 ( req - > r_attempts - 1 ) ;
}
2010-07-15 14:58:39 -07:00
/* remove cap/dentry releases from message */
2023-07-25 17:51:59 +08:00
lhead - > num_releases = 0 ;
2014-07-01 16:54:34 +08:00
p = msg - > front . iov_base + req - > r_request_release_offset ;
2020-07-27 10:16:09 -04:00
encode_mclientrequest_tail ( & p , req ) ;
2014-07-01 16:54:34 +08:00
msg - > front . iov_len = p - msg - > front . iov_base ;
msg - > hdr . front_len = cpu_to_le32 ( msg - > front . iov_len ) ;
2010-07-15 13:24:32 -07:00
return 0 ;
}
2009-10-06 11:31:09 -07:00
if ( req - > r_request ) {
ceph_msg_put ( req - > r_request ) ;
req - > r_request = NULL ;
}
2020-12-09 10:12:59 -05:00
msg = create_request_message ( session , req , drop_cap_releases ) ;
2009-10-06 11:31:09 -07:00
if ( IS_ERR ( msg ) ) {
2010-05-13 11:19:06 -07:00
req - > r_err = PTR_ERR ( msg ) ;
2010-04-01 16:06:19 -07:00
return PTR_ERR ( msg ) ;
2009-10-06 11:31:09 -07:00
}
req - > r_request = msg ;
2023-07-25 17:51:59 +08:00
lhead = find_legacy_request_head ( msg - > front . iov_base ,
session - > s_con . peer_features ) ;
lhead - > oldest_client_tid = cpu_to_le64 ( __get_oldest_tid ( mdsc ) ) ;
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) )
2009-10-06 11:31:09 -07:00
flags | = CEPH_MDS_FLAG_REPLAY ;
2019-12-02 13:47:57 -05:00
if ( test_bit ( CEPH_MDS_R_ASYNC , & req - > r_req_flags ) )
flags | = CEPH_MDS_FLAG_ASYNC ;
2017-01-31 10:28:26 -05:00
if ( req - > r_parent )
2009-10-06 11:31:09 -07:00
flags | = CEPH_MDS_FLAG_WANT_DENTRY ;
2023-07-25 17:51:59 +08:00
lhead - > flags = cpu_to_le32 ( flags ) ;
lhead - > num_fwd = req - > r_num_fwd ;
lhead - > num_retry = req - > r_attempts - 1 ;
if ( ! old_version ) {
nhead = ( struct ceph_mds_request_head * ) msg - > front . iov_base ;
nhead - > ext_num_fwd = cpu_to_le32 ( req - > r_num_fwd ) ;
nhead - > ext_num_retry = cpu_to_le32 ( req - > r_attempts - 1 ) ;
}
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " r_parent = %p \n " , req - > r_parent ) ;
2009-10-06 11:31:09 -07:00
return 0 ;
}
2019-12-05 20:50:21 -05:00
/*
* called under mdsc - > mutex
*/
2020-12-09 08:24:18 -05:00
static int __send_request ( struct ceph_mds_session * session ,
2019-12-05 20:50:21 -05:00
struct ceph_mds_request * req ,
bool drop_cap_releases )
{
int err ;
2020-12-09 08:24:18 -05:00
err = __prepare_send_request ( session , req , drop_cap_releases ) ;
2019-12-05 20:50:21 -05:00
if ( ! err ) {
ceph_msg_get ( req - > r_request ) ;
ceph_con_send ( & session - > s_con , req - > r_request ) ;
}
return err ;
}
2009-10-06 11:31:09 -07:00
/*
* send request , or put it on the appropriate wait list .
*/
2018-07-28 16:30:48 +08:00
static void __do_request ( struct ceph_mds_client * mdsc ,
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_session * session = NULL ;
int mds = - 1 ;
2015-07-01 16:27:46 +08:00
int err = 0 ;
2019-12-09 07:47:15 -05:00
bool random ;
2009-10-06 11:31:09 -07:00
2017-02-01 13:49:09 -05:00
if ( req - > r_err | | test_bit ( CEPH_MDS_R_GOT_RESULT , & req - > r_req_flags ) ) {
if ( test_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) )
2013-09-26 14:25:36 +08:00
__unregister_request ( mdsc , req ) ;
2018-07-28 16:30:48 +08:00
return ;
2013-09-26 14:25:36 +08:00
}
2009-10-06 11:31:09 -07:00
2023-02-01 09:36:45 +08:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_FENCE_IO ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " metadata corrupted \n " ) ;
2023-02-01 09:36:45 +08:00
err = - EIO ;
goto finish ;
}
2009-10-06 11:31:09 -07:00
if ( req - > r_timeout & &
time_after_eq ( jiffies , req - > r_started + req - > r_timeout ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " timed out \n " ) ;
2020-02-23 22:23:11 -05:00
err = - ETIMEDOUT ;
2009-10-06 11:31:09 -07:00
goto finish ;
}
2016-12-26 10:26:34 +01:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_SHUTDOWN ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " forced umount \n " ) ;
2015-07-01 16:27:46 +08:00
err = - EIO ;
goto finish ;
}
2016-12-26 10:26:34 +01:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_MOUNTING ) {
2016-11-10 16:02:06 +08:00
if ( mdsc - > mdsmap_err ) {
err = mdsc - > mdsmap_err ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mdsmap err %d \n " , err ) ;
2016-11-10 16:02:06 +08:00
goto finish ;
}
2017-01-04 16:21:58 +08:00
if ( mdsc - > mdsmap - > m_epoch = = 0 ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " no mdsmap, waiting for map \n " ) ;
2017-01-04 16:21:58 +08:00
list_add ( & req - > r_wait , & mdsc - > waiting_for_map ) ;
2018-07-28 16:30:48 +08:00
return ;
2017-01-04 16:21:58 +08:00
}
2016-11-10 16:02:06 +08:00
if ( ! ( mdsc - > fsc - > mount_options - > flags &
CEPH_MOUNT_OPT_MOUNTWAIT ) & &
! ceph_mdsmap_is_cluster_available ( mdsc - > mdsmap ) ) {
2019-12-10 20:29:40 -05:00
err = - EHOSTUNREACH ;
2016-11-10 16:02:06 +08:00
goto finish ;
}
}
2009-10-06 11:31:09 -07:00
2010-11-02 13:49:00 -07:00
put_request_session ( req ) ;
2019-12-09 07:47:15 -05:00
mds = __choose_mds ( mdsc , req , & random ) ;
2009-10-06 11:31:09 -07:00
if ( mds < 0 | |
ceph_mdsmap_get_state ( mdsc - > mdsmap , mds ) < CEPH_MDS_STATE_ACTIVE ) {
2019-12-02 13:47:57 -05:00
if ( test_bit ( CEPH_MDS_R_ASYNC , & req - > r_req_flags ) ) {
err = - EJUKEBOX ;
goto finish ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " no mds or not active, waiting for map \n " ) ;
2009-10-06 11:31:09 -07:00
list_add ( & req - > r_wait , & mdsc - > waiting_for_map ) ;
2018-07-28 16:30:48 +08:00
return ;
2009-10-06 11:31:09 -07:00
}
/* get, open session */
session = __ceph_lookup_mds_session ( mdsc , mds ) ;
2010-03-20 20:43:28 -07:00
if ( ! session ) {
2009-10-06 11:31:09 -07:00
session = register_session ( mdsc , mds ) ;
2010-03-20 20:43:28 -07:00
if ( IS_ERR ( session ) ) {
err = PTR_ERR ( session ) ;
goto finish ;
}
}
2019-12-19 19:44:09 -05:00
req - > r_session = ceph_get_mds_session ( session ) ;
2010-11-02 13:49:00 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d session %p state %s \n " , mds , session ,
ceph_session_state_name ( session - > s_state ) ) ;
2022-07-27 12:29:10 +08:00
/*
* The old ceph will crash the MDSs when see unknown OPs
*/
if ( req - > r_feature_needed > 0 & &
! test_bit ( req - > r_feature_needed , & session - > s_features ) ) {
err = - EOPNOTSUPP ;
goto out_session ;
}
2009-10-06 11:31:09 -07:00
if ( session - > s_state ! = CEPH_MDS_SESSION_OPEN & &
session - > s_state ! = CEPH_MDS_SESSION_HUNG ) {
2019-12-02 13:47:57 -05:00
/*
* We cannot queue async requests since the caps and delegated
* inodes are bound to the session . Just return - EJUKEBOX and
* let the caller retry a sync request in that case .
*/
if ( test_bit ( CEPH_MDS_R_ASYNC , & req - > r_req_flags ) ) {
err = - EJUKEBOX ;
goto out_session ;
}
2020-09-21 13:12:53 -04:00
/*
* If the session has been REJECTED , then return a hard error ,
* unless it ' s a CLEANRECOVER mount , in which case we ' ll queue
* it to the mdsc queue .
*/
if ( session - > s_state = = CEPH_MDS_SESSION_REJECTED ) {
if ( ceph_test_mount_opt ( mdsc - > fsc , CLEANRECOVER ) )
list_add ( & req - > r_wait , & mdsc - > waiting_for_map ) ;
else
err = - EACCES ;
goto out_session ;
}
2009-10-06 11:31:09 -07:00
if ( session - > s_state = = CEPH_MDS_SESSION_NEW | |
2019-12-09 07:47:15 -05:00
session - > s_state = = CEPH_MDS_SESSION_CLOSING ) {
2020-06-30 03:52:18 -04:00
err = __open_session ( mdsc , session ) ;
if ( err )
goto out_session ;
2019-12-09 07:47:15 -05:00
/* retry the same mds later */
if ( random )
req - > r_resend_mds = mds ;
}
2009-10-06 11:31:09 -07:00
list_add ( & req - > r_wait , & session - > s_waiting ) ;
goto out_session ;
}
/* send request */
req - > r_resend_mds = - 1 ; /* forget any previous mds hint */
if ( req - > r_request_started = = 0 ) /* note request start time */
req - > r_request_started = jiffies ;
2022-06-10 09:53:21 +08:00
/*
* For async create we will choose the auth MDS of frag in parent
* directory to send the request and ususally this works fine , but
* if the migrated the dirtory to another MDS before it could handle
* it the request will be forwarded .
*
* And then the auth cap will be changed .
*/
if ( test_bit ( CEPH_MDS_R_ASYNC , & req - > r_req_flags ) & & req - > r_num_fwd ) {
struct ceph_dentry_info * di = ceph_dentry ( req - > r_dentry ) ;
struct ceph_inode_info * ci ;
struct ceph_cap * cap ;
/*
* The request maybe handled very fast and the new inode
* hasn ' t been linked to the dentry yet . We need to wait
* for the ceph_finish_async_create ( ) , which shouldn ' t be
* stuck too long or fail in thoery , to finish when forwarding
* the request .
*/
if ( ! d_inode ( req - > r_dentry ) ) {
err = wait_on_bit ( & di - > flags , CEPH_DENTRY_ASYNC_CREATE_BIT ,
TASK_KILLABLE ) ;
if ( err ) {
mutex_lock ( & req - > r_fill_mutex ) ;
set_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) ;
mutex_unlock ( & req - > r_fill_mutex ) ;
goto out_session ;
}
}
ci = ceph_inode ( d_inode ( req - > r_dentry ) ) ;
spin_lock ( & ci - > i_ceph_lock ) ;
cap = ci - > i_auth_cap ;
if ( ci - > i_ceph_flags & CEPH_I_ASYNC_CREATE & & mds ! = cap - > mds ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " session changed for auth cap %d -> %d \n " ,
cap - > session - > s_mds , session - > s_mds ) ;
2022-06-10 09:53:21 +08:00
/* Remove the auth cap from old session */
spin_lock ( & cap - > session - > s_cap_lock ) ;
cap - > session - > s_nr_caps - - ;
list_del_init ( & cap - > session_caps ) ;
spin_unlock ( & cap - > session - > s_cap_lock ) ;
/* Add the auth cap to the new session */
cap - > mds = mds ;
cap - > session = session ;
spin_lock ( & session - > s_cap_lock ) ;
session - > s_nr_caps + + ;
list_add_tail ( & cap - > session_caps , & session - > s_caps ) ;
spin_unlock ( & session - > s_cap_lock ) ;
change_auth_cap_ses ( ci , session ) ;
}
spin_unlock ( & ci - > i_ceph_lock ) ;
}
2020-12-09 08:24:18 -05:00
err = __send_request ( session , req , false ) ;
2009-10-06 11:31:09 -07:00
out_session :
ceph_put_mds_session ( session ) ;
2015-07-01 16:27:46 +08:00
finish :
if ( err ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " early error %d \n " , err ) ;
2015-07-01 16:27:46 +08:00
req - > r_err = err ;
complete_request ( mdsc , req ) ;
__unregister_request ( mdsc , req ) ;
}
2018-07-28 16:30:48 +08:00
return ;
2009-10-06 11:31:09 -07:00
}
/*
* called under mdsc - > mutex
*/
static void __wake_requests ( struct ceph_mds_client * mdsc ,
struct list_head * head )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2012-11-19 10:49:06 +08:00
struct ceph_mds_request * req ;
LIST_HEAD ( tmp_list ) ;
list_splice_init ( head , & tmp_list ) ;
2009-10-06 11:31:09 -07:00
2012-11-19 10:49:06 +08:00
while ( ! list_empty ( & tmp_list ) ) {
req = list_entry ( tmp_list . next ,
struct ceph_mds_request , r_wait ) ;
2009-10-06 11:31:09 -07:00
list_del_init ( & req - > r_wait ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " wake request %p tid %llu \n " , req ,
req - > r_tid ) ;
2009-10-06 11:31:09 -07:00
__do_request ( mdsc , req ) ;
}
}
/*
* Wake up threads with requests pending for @ mds , so that they can
2010-03-18 14:45:05 -07:00
* resubmit their requests to a possibly different mds .
2009-10-06 11:31:09 -07:00
*/
2010-03-18 14:45:05 -07:00
static void kick_requests ( struct ceph_mds_client * mdsc , int mds )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2010-02-15 12:08:46 -08:00
struct ceph_mds_request * req ;
2014-07-30 10:12:47 +08:00
struct rb_node * p = rb_first ( & mdsc - > request_tree ) ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " kick_requests mds%d \n " , mds ) ;
2014-07-30 10:12:47 +08:00
while ( p ) {
2010-02-15 12:08:46 -08:00
req = rb_entry ( p , struct ceph_mds_request , r_node ) ;
2014-07-30 10:12:47 +08:00
p = rb_next ( p ) ;
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) )
2010-02-15 12:08:46 -08:00
continue ;
2015-02-04 14:26:22 +08:00
if ( req - > r_attempts > 0 )
continue ; /* only new requests */
2010-02-15 12:08:46 -08:00
if ( req - > r_session & &
req - > r_session - > s_mds = = mds ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " kicking tid %llu \n " , req - > r_tid ) ;
2014-09-11 14:28:56 +08:00
list_del_init ( & req - > r_wait ) ;
2010-02-15 12:08:46 -08:00
__do_request ( mdsc , req ) ;
2009-10-06 11:31:09 -07:00
}
}
}
2019-04-02 09:24:36 -04:00
int ceph_mdsc_submit_request ( struct ceph_mds_client * mdsc , struct inode * dir ,
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2020-01-14 15:06:40 -05:00
int err = 0 ;
2019-04-02 09:24:36 -04:00
/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
if ( req - > r_inode )
ceph_get_cap_refs ( ceph_inode ( req - > r_inode ) , CEPH_CAP_PIN ) ;
2019-04-03 13:16:01 -04:00
if ( req - > r_parent ) {
2020-03-05 20:21:00 +08:00
struct ceph_inode_info * ci = ceph_inode ( req - > r_parent ) ;
int fmode = ( req - > r_op & CEPH_MDS_OP_WRITE ) ?
CEPH_FILE_MODE_WR : CEPH_FILE_MODE_RD ;
spin_lock ( & ci - > i_ceph_lock ) ;
ceph_take_cap_refs ( ci , CEPH_CAP_PIN , false ) ;
__ceph_touch_fmode ( ci , mdsc , fmode ) ;
spin_unlock ( & ci - > i_ceph_lock ) ;
2019-04-03 13:16:01 -04:00
}
2019-04-02 09:24:36 -04:00
if ( req - > r_old_dentry_dir )
ceph_get_cap_refs ( ceph_inode ( req - > r_old_dentry_dir ) ,
CEPH_CAP_PIN ) ;
2020-01-14 15:06:40 -05:00
if ( req - > r_inode ) {
err = ceph_wait_on_async_create ( req - > r_inode ) ;
if ( err ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " wait for async create returned: %d \n " , err ) ;
2020-01-14 15:06:40 -05:00
return err ;
}
}
if ( ! err & & req - > r_old_inode ) {
err = ceph_wait_on_async_create ( req - > r_old_inode ) ;
if ( err ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " wait for async create returned: %d \n " , err ) ;
2020-01-14 15:06:40 -05:00
return err ;
}
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " submit_request on %p for inode %p \n " , req , dir ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2019-04-02 09:24:36 -04:00
__register_request ( mdsc , req , dir ) ;
2009-10-06 11:31:09 -07:00
__do_request ( mdsc , req ) ;
2019-04-02 09:24:36 -04:00
err = req - > r_err ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2019-04-02 09:24:36 -04:00
return err ;
2009-10-06 11:31:09 -07:00
}
2022-02-03 09:04:24 -05:00
int ceph_mdsc_wait_request ( struct ceph_mds_client * mdsc ,
struct ceph_mds_request * req ,
ceph_mds_request_wait_callback_t wait_func )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
int err ;
2010-05-13 11:19:06 -07:00
/* wait */
2023-06-12 09:04:07 +08:00
doutc ( cl , " do_request waiting \n " ) ;
2022-02-03 09:04:24 -05:00
if ( wait_func ) {
err = wait_func ( mdsc , req ) ;
2010-05-13 11:19:06 -07:00
} else {
2015-05-19 12:05:38 +03:00
long timeleft = wait_for_completion_killable_timeout (
& req - > r_completion ,
ceph_timeout_jiffies ( req - > r_timeout ) ) ;
if ( timeleft > 0 )
err = 0 ;
else if ( ! timeleft )
2020-02-23 22:23:11 -05:00
err = - ETIMEDOUT ; /* timed out */
2015-05-19 12:05:38 +03:00
else
err = timeleft ; /* killed */
2010-05-13 11:19:06 -07:00
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " do_request waited, got %d \n " , err ) ;
2010-05-13 11:19:06 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2010-01-25 11:33:08 -08:00
2010-05-13 11:19:06 -07:00
/* only abort if we didn't race with a real reply */
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_RESULT , & req - > r_req_flags ) ) {
2010-05-13 11:19:06 -07:00
err = le32_to_cpu ( req - > r_reply_info . head - > result ) ;
} else if ( err < 0 ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " aborted request %lld with %d \n " , req - > r_tid , err ) ;
2010-05-13 12:01:13 -07:00
/*
* ensure we aren ' t running concurrently with
* ceph_fill_trace or ceph_readdir_prepopulate , which
* rely on locks ( dir mutex ) held by our caller .
*/
mutex_lock ( & req - > r_fill_mutex ) ;
2010-05-13 11:19:06 -07:00
req - > r_err = err ;
2017-02-01 13:49:09 -05:00
set_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) ;
2010-05-13 12:01:13 -07:00
mutex_unlock ( & req - > r_fill_mutex ) ;
2010-01-25 11:33:08 -08:00
2017-01-31 10:28:26 -05:00
if ( req - > r_parent & &
2010-05-14 10:02:57 -07:00
( req - > r_op & CEPH_MDS_OP_WRITE ) )
ceph_invalidate_dir_request ( req ) ;
2009-10-06 11:31:09 -07:00
} else {
2010-05-13 11:19:06 -07:00
err = req - > r_err ;
2009-10-06 11:31:09 -07:00
}
2010-05-13 11:19:06 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2019-04-02 12:34:38 -04:00
return err ;
}
/*
* Synchrously perform an mds request . Take care of all of the
* session setup , forwarding , retry details .
*/
int ceph_mdsc_do_request ( struct ceph_mds_client * mdsc ,
struct inode * dir ,
struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-04-02 12:34:38 -04:00
int err ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " do_request on %p \n " , req ) ;
2019-04-02 12:34:38 -04:00
/* issue */
err = ceph_mdsc_submit_request ( mdsc , dir , req ) ;
if ( ! err )
2022-02-03 09:04:24 -05:00
err = ceph_mdsc_wait_request ( mdsc , req , NULL ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " do_request %p done, result %d \n " , req , err ) ;
2009-10-06 11:31:09 -07:00
return err ;
}
2010-05-14 10:02:57 -07:00
/*
2013-03-13 19:44:32 +08:00
* Invalidate dir ' s completeness , dentry lease state on an aborted MDS
2010-05-14 10:02:57 -07:00
* namespace request .
*/
void ceph_invalidate_dir_request ( struct ceph_mds_request * req )
{
2017-11-24 11:51:32 +08:00
struct inode * dir = req - > r_parent ;
struct inode * old_dir = req - > r_old_dentry_dir ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = req - > r_mdsc - > fsc - > client ;
2010-05-14 10:02:57 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " invalidate_dir_request %p %p (complete, lease(s)) \n " ,
dir , old_dir ) ;
2010-05-14 10:02:57 -07:00
2017-11-24 11:51:32 +08:00
ceph_dir_clear_complete ( dir ) ;
if ( old_dir )
ceph_dir_clear_complete ( old_dir ) ;
2010-05-14 10:02:57 -07:00
if ( req - > r_dentry )
ceph_invalidate_dentry_lease ( req - > r_dentry ) ;
if ( req - > r_old_dentry )
ceph_invalidate_dentry_lease ( req - > r_old_dentry ) ;
}
2009-10-06 11:31:09 -07:00
/*
* Handle mds reply .
*
* We take the session mutex and parse and process the reply immediately .
* This preserves the logical ordering of replies , capabilities , etc . , sent
* by the MDS as they are applied to our local cache .
*/
static void handle_reply ( struct ceph_mds_session * session , struct ceph_msg * msg )
{
struct ceph_mds_client * mdsc = session - > s_mdsc ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req ;
struct ceph_mds_reply_head * head = msg - > front . iov_base ;
struct ceph_mds_reply_info_parsed * rinfo ; /* parsed reply info */
2014-12-23 15:30:54 +08:00
struct ceph_snap_realm * realm ;
2009-10-06 11:31:09 -07:00
u64 tid ;
int err , result ;
2010-02-22 15:12:16 -08:00
int mds = session - > s_mds ;
2023-02-01 09:36:45 +08:00
bool close_sessions = false ;
2009-10-06 11:31:09 -07:00
if ( msg - > front . iov_len < sizeof ( * head ) ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " got corrupt (short) reply \n " ) ;
2009-12-14 15:13:47 -08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
return ;
}
/* get request, session */
2009-12-22 11:24:33 -08:00
tid = le64_to_cpu ( msg - > hdr . tid ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2016-04-28 16:07:22 +02:00
req = lookup_get_request ( mdsc , tid ) ;
2009-10-06 11:31:09 -07:00
if ( ! req ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " on unknown tid %llu \n " , tid ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
return ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " handle_reply %p \n " , req ) ;
2009-10-06 11:31:09 -07:00
/* correct session? */
2010-03-20 20:50:58 -07:00
if ( req - > r_session ! = session ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " got %llu on session mds%d not mds%d \n " ,
tid , session - > s_mds ,
req - > r_session ? req - > r_session - > s_mds : - 1 ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
goto out ;
}
/* dup? */
2017-02-01 13:49:09 -05:00
if ( ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) & & ! head - > safe ) | |
( test_bit ( CEPH_MDS_R_GOT_SAFE , & req - > r_req_flags ) & & head - > safe ) ) {
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl , " got a dup %s reply on %llu from mds%d \n " ,
head - > safe ? " safe " : " unsafe " , tid , mds ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
goto out ;
}
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_SAFE , & req - > r_req_flags ) ) {
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl , " got unsafe after safe on %llu from mds%d \n " ,
tid , mds ) ;
2010-05-13 09:06:02 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
goto out ;
}
2009-10-06 11:31:09 -07:00
result = le32_to_cpu ( head - > result ) ;
if ( head - > safe ) {
2017-02-01 13:49:09 -05:00
set_bit ( CEPH_MDS_R_GOT_SAFE , & req - > r_req_flags ) ;
2009-10-06 11:31:09 -07:00
__unregister_request ( mdsc , req ) ;
2019-12-04 01:27:18 -05:00
/* last request during umount? */
if ( mdsc - > stopping & & ! __get_oldest_req ( mdsc ) )
complete_all ( & mdsc - > safe_umount_waiters ) ;
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) ) {
2009-10-06 11:31:09 -07:00
/*
* We already handled the unsafe response , now do the
* cleanup . No need to examine the response ; the MDS
* doesn ' t include any result info in the safe
* response . And even if it did , there is nothing
* useful we could do with a revised return value .
*/
2023-06-12 09:04:07 +08:00
doutc ( cl , " got safe reply %llu, mds%d \n " , tid , mds ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
goto out ;
}
2010-05-13 11:19:06 -07:00
} else {
2017-02-01 13:49:09 -05:00
set_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) ;
2009-10-06 11:31:09 -07:00
list_add_tail ( & req - > r_unsafe_item , & req - > r_session - > s_unsafe ) ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " tid %lld result %d \n " , tid , result ) ;
2019-01-09 10:10:17 +08:00
if ( test_bit ( CEPHFS_FEATURE_REPLY_ENCODING , & session - > s_features ) )
2022-03-14 10:28:34 +08:00
err = parse_reply_info ( session , msg , req , ( u64 ) - 1 ) ;
2019-01-09 10:10:17 +08:00
else
2022-03-14 10:28:34 +08:00
err = parse_reply_info ( session , msg , req ,
session - > s_con . peer_features ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2020-11-12 10:03:38 -05:00
/* Must find target inode outside of mutexes to avoid deadlocks */
2022-03-14 10:28:34 +08:00
rinfo = & req - > r_reply_info ;
2020-11-12 10:03:38 -05:00
if ( ( err > = 0 ) & & rinfo - > head - > is_target ) {
ceph: preallocate inode for ops that may create one
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-26 13:11:00 -04:00
struct inode * in = xchg ( & req - > r_new_inode , NULL ) ;
2020-11-12 10:03:38 -05:00
struct ceph_vino tvino = {
. ino = le64_to_cpu ( rinfo - > targeti . in - > ino ) ,
. snap = le64_to_cpu ( rinfo - > targeti . in - > snapid )
} ;
ceph: preallocate inode for ops that may create one
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-26 13:11:00 -04:00
/*
* If we ended up opening an existing inode , discard
* r_new_inode
*/
if ( req - > r_op = = CEPH_MDS_OP_CREATE & &
! req - > r_reply_info . has_create_ino ) {
/* This should never happen on an async create */
WARN_ON_ONCE ( req - > r_deleg_ino ) ;
iput ( in ) ;
in = NULL ;
}
in = ceph_get_inode ( mdsc - > fsc - > sb , tvino , in ) ;
2020-11-12 10:03:38 -05:00
if ( IS_ERR ( in ) ) {
err = PTR_ERR ( in ) ;
mutex_lock ( & session - > s_mutex ) ;
goto out_err ;
}
req - > r_target_inode = in ;
}
2009-10-06 11:31:09 -07:00
mutex_lock ( & session - > s_mutex ) ;
if ( err < 0 ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " got corrupt reply mds%d(tid:%lld) \n " ,
mds , tid ) ;
2009-12-14 15:13:47 -08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
goto out_err ;
}
/* snap trace */
2014-12-23 15:30:54 +08:00
realm = NULL ;
2009-10-06 11:31:09 -07:00
if ( rinfo - > snapblob_len ) {
down_write ( & mdsc - > snap_rwsem ) ;
2023-02-01 09:36:45 +08:00
err = ceph_update_snap_trace ( mdsc , rinfo - > snapblob ,
2014-12-23 15:30:54 +08:00
rinfo - > snapblob + rinfo - > snapblob_len ,
le32_to_cpu ( head - > op ) = = CEPH_MDS_OP_RMSNAP ,
& realm ) ;
2023-02-01 09:36:45 +08:00
if ( err ) {
up_write ( & mdsc - > snap_rwsem ) ;
close_sessions = true ;
if ( err = = - EIO )
ceph_msg_dump ( msg ) ;
goto out_err ;
}
2009-10-06 11:31:09 -07:00
downgrade_write ( & mdsc - > snap_rwsem ) ;
} else {
down_read ( & mdsc - > snap_rwsem ) ;
}
/* insert trace into our cache */
2010-05-13 12:01:13 -07:00
mutex_lock ( & req - > r_fill_mutex ) ;
2016-03-07 10:34:50 +08:00
current - > journal_info = req ;
2017-01-31 11:06:13 -05:00
err = ceph_fill_trace ( mdsc - > fsc - > sb , req ) ;
2009-10-06 11:31:09 -07:00
if ( err = = 0 ) {
2012-12-28 09:56:46 -08:00
if ( result = = 0 & & ( req - > r_op = = CEPH_MDS_OP_READDIR | |
2013-09-18 09:44:13 +08:00
req - > r_op = = CEPH_MDS_OP_LSSNAP ) )
2022-03-14 10:28:35 +08:00
err = ceph_readdir_prepopulate ( req , req - > r_session ) ;
2009-10-06 11:31:09 -07:00
}
2016-03-07 10:34:50 +08:00
current - > journal_info = NULL ;
2010-05-13 12:01:13 -07:00
mutex_unlock ( & req - > r_fill_mutex ) ;
2009-10-06 11:31:09 -07:00
up_read ( & mdsc - > snap_rwsem ) ;
2014-12-23 15:30:54 +08:00
if ( realm )
ceph_put_snap_realm ( mdsc , realm ) ;
2015-10-27 18:36:06 +08:00
2019-02-01 14:57:15 +08:00
if ( err = = 0 ) {
if ( req - > r_target_inode & &
test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) ) {
struct ceph_inode_info * ci =
ceph_inode ( req - > r_target_inode ) ;
spin_lock ( & ci - > i_unsafe_lock ) ;
list_add_tail ( & req - > r_unsafe_target_item ,
& ci - > i_unsafe_iops ) ;
spin_unlock ( & ci - > i_unsafe_lock ) ;
}
ceph_unreserve_caps ( mdsc , & req - > r_caps_reservation ) ;
2015-10-27 18:36:06 +08:00
}
2009-10-06 11:31:09 -07:00
out_err :
2010-05-13 11:19:06 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2017-02-01 13:49:09 -05:00
if ( ! test_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) ) {
2010-05-13 11:19:06 -07:00
if ( err ) {
req - > r_err = err ;
} else {
2015-08-18 10:30:38 +08:00
req - > r_reply = ceph_msg_get ( msg ) ;
2017-02-01 13:49:09 -05:00
set_bit ( CEPH_MDS_R_GOT_RESULT , & req - > r_req_flags ) ;
2010-05-13 11:19:06 -07:00
}
2009-10-06 11:31:09 -07:00
} else {
2023-06-12 09:04:07 +08:00
doutc ( cl , " reply arrived after request %lld was aborted \n " , tid ) ;
2009-10-06 11:31:09 -07:00
}
2010-05-13 11:19:06 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & session - > s_mutex ) ;
/* kick calling process */
complete_request ( mdsc , req ) ;
2020-03-19 23:45:02 -04:00
2021-03-22 20:28:49 +08:00
ceph_update_metadata_metrics ( & mdsc - > metric , req - > r_start_latency ,
2020-03-19 23:45:02 -04:00
req - > r_end_latency , err ) ;
2009-10-06 11:31:09 -07:00
out :
ceph_mdsc_put_request ( req ) ;
2023-02-01 09:36:45 +08:00
/* Defer closing the sessions after s_mutex lock being released */
if ( close_sessions )
ceph_mdsc_close_sessions ( mdsc ) ;
2009-10-06 11:31:09 -07:00
return ;
}
/*
* handle mds notification that our request has been forwarded .
*/
2010-02-22 15:12:16 -08:00
static void handle_forward ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session ,
struct ceph_msg * msg )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req ;
2010-02-23 14:02:44 -08:00
u64 tid = le64_to_cpu ( msg - > hdr . tid ) ;
2009-10-06 11:31:09 -07:00
u32 next_mds ;
u32 fwd_seq ;
int err = - EINVAL ;
void * p = msg - > front . iov_base ;
void * end = p + msg - > front . iov_len ;
2022-03-29 12:48:01 +08:00
bool aborted = false ;
2009-10-06 11:31:09 -07:00
2010-02-23 14:02:44 -08:00
ceph_decode_need ( & p , end , 2 * sizeof ( u32 ) , bad ) ;
2009-10-14 09:59:09 -07:00
next_mds = ceph_decode_32 ( & p ) ;
fwd_seq = ceph_decode_32 ( & p ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2016-04-28 16:07:22 +02:00
req = lookup_get_request ( mdsc , tid ) ;
2009-10-06 11:31:09 -07:00
if ( ! req ) {
2022-03-29 12:48:01 +08:00
mutex_unlock ( & mdsc - > mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " forward tid %llu to mds%d - req dne \n " , tid , next_mds ) ;
2022-03-29 12:48:01 +08:00
return ; /* dup reply? */
2009-10-06 11:31:09 -07:00
}
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " forward tid %llu aborted, unregistering \n " , tid ) ;
2010-05-28 16:43:16 -07:00
__unregister_request ( mdsc , req ) ;
2023-07-25 17:51:59 +08:00
} else if ( fwd_seq < = req - > r_num_fwd | | ( uint32_t ) fwd_seq > = U32_MAX ) {
2022-03-29 12:48:01 +08:00
/*
2023-07-25 17:51:59 +08:00
* Avoid inifinite retrying after overflow .
2022-03-29 12:48:01 +08:00
*
2023-07-25 17:51:59 +08:00
* The MDS will increase the fwd count and in client side
* if the num_fwd is less than the one saved in request
* that means the MDS is an old version and overflowed of
* 8 bits .
2022-03-29 12:48:01 +08:00
*/
2023-07-25 17:51:59 +08:00
mutex_lock ( & req - > r_fill_mutex ) ;
req - > r_err = - EMULTIHOP ;
set_bit ( CEPH_MDS_R_ABORTED , & req - > r_req_flags ) ;
mutex_unlock ( & req - > r_fill_mutex ) ;
aborted = true ;
2023-06-12 09:04:07 +08:00
pr_warn_ratelimited_client ( cl , " forward tid %llu seq overflow \n " ,
tid ) ;
2009-10-06 11:31:09 -07:00
} else {
/* resend. forward race not possible; mds would drop */
2023-06-12 09:04:07 +08:00
doutc ( cl , " forward tid %llu to mds%d (we resend) \n " , tid , next_mds ) ;
2010-05-28 16:43:16 -07:00
BUG_ON ( req - > r_err ) ;
2017-02-01 13:49:09 -05:00
BUG_ON ( test_bit ( CEPH_MDS_R_GOT_RESULT , & req - > r_req_flags ) ) ;
2015-02-04 14:26:22 +08:00
req - > r_attempts = 0 ;
2009-10-06 11:31:09 -07:00
req - > r_num_fwd = fwd_seq ;
req - > r_resend_mds = next_mds ;
put_request_session ( req ) ;
__do_request ( mdsc , req ) ;
}
mutex_unlock ( & mdsc - > mutex ) ;
2022-03-29 12:48:01 +08:00
/* kick calling process */
if ( aborted )
complete_request ( mdsc , req ) ;
ceph_mdsc_put_request ( req ) ;
2009-10-06 11:31:09 -07:00
return ;
bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " decode error err=%d \n " , err ) ;
2023-05-18 09:40:14 +08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
}
2019-07-25 20:16:47 +08:00
static int __decode_session_metadata ( void * * p , void * end ,
2020-09-14 13:39:19 +02:00
bool * blocklisted )
2018-12-21 17:41:39 +08:00
{
/* map<string,string> */
u32 n ;
2019-07-25 20:16:47 +08:00
bool err_str ;
2018-12-21 17:41:39 +08:00
ceph_decode_32_safe ( p , end , n , bad ) ;
while ( n - - > 0 ) {
u32 len ;
ceph_decode_32_safe ( p , end , len , bad ) ;
ceph_decode_need ( p , end , len , bad ) ;
2019-07-25 20:16:47 +08:00
err_str = ! strncmp ( * p , " error_string " , len ) ;
2018-12-21 17:41:39 +08:00
* p + = len ;
ceph_decode_32_safe ( p , end , len , bad ) ;
ceph_decode_need ( p , end , len , bad ) ;
2020-09-15 21:11:30 +02:00
/*
* Match " blocklisted (blacklisted) " from newer MDSes ,
* or " blacklisted " from older MDSes .
*/
2019-07-25 20:16:47 +08:00
if ( err_str & & strnstr ( * p , " blacklisted " , len ) )
2020-09-14 13:39:19 +02:00
* blocklisted = true ;
2018-12-21 17:41:39 +08:00
* p + = len ;
}
return 0 ;
bad :
return - 1 ;
}
2009-10-06 11:31:09 -07:00
/*
* handle a mds session control message
*/
static void handle_session ( struct ceph_mds_session * session ,
struct ceph_msg * msg )
{
struct ceph_mds_client * mdsc = session - > s_mdsc ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2018-12-21 17:41:39 +08:00
int mds = session - > s_mds ;
int msg_version = le16_to_cpu ( msg - > hdr . version ) ;
void * p = msg - > front . iov_base ;
void * end = p + msg - > front . iov_len ;
struct ceph_mds_session_head * h ;
2009-10-06 11:31:09 -07:00
u32 op ;
2020-04-28 08:10:22 -04:00
u64 seq , features = 0 ;
2009-10-06 11:31:09 -07:00
int wake = 0 ;
2020-09-14 13:39:19 +02:00
bool blocklisted = false ;
2009-10-06 11:31:09 -07:00
/* decode */
2018-12-21 17:41:39 +08:00
ceph_decode_need ( & p , end , sizeof ( * h ) , bad ) ;
h = p ;
p + = sizeof ( * h ) ;
2009-10-06 11:31:09 -07:00
op = le32_to_cpu ( h - > op ) ;
seq = le64_to_cpu ( h - > seq ) ;
2018-12-21 17:41:39 +08:00
if ( msg_version > = 3 ) {
u32 len ;
2021-09-27 19:22:27 +05:30
/* version >= 2 and < 5, decode metadata, skip otherwise
* as it ' s handled via flags .
*/
if ( msg_version > = 5 )
ceph_decode_skip_map ( & p , end , string , string , bad ) ;
else if ( __decode_session_metadata ( & p , end , & blocklisted ) < 0 )
2018-12-21 17:41:39 +08:00
goto bad ;
2021-09-27 19:22:27 +05:30
2018-12-21 17:41:39 +08:00
/* version >= 3, feature bits */
ceph_decode_32_safe ( & p , end , len , bad ) ;
2020-08-04 12:31:56 -04:00
if ( len ) {
ceph_decode_64_safe ( & p , end , features , bad ) ;
p + = len - sizeof ( features ) ;
}
2018-12-21 17:41:39 +08:00
}
2021-09-27 19:22:27 +05:30
if ( msg_version > = 5 ) {
2022-05-23 17:09:51 +01:00
u32 flags , len ;
/* version >= 4 */
ceph_decode_skip_16 ( & p , end , bad ) ; /* struct_v, struct_cv */
ceph_decode_32_safe ( & p , end , len , bad ) ; /* len */
ceph_decode_skip_n ( & p , end , len , bad ) ; /* metric_spec */
2021-09-27 19:22:27 +05:30
/* version >= 5, flags */
2022-05-23 17:09:51 +01:00
ceph_decode_32_safe ( & p , end , flags , bad ) ;
2021-09-27 19:22:27 +05:30
if ( flags & CEPH_SESSION_BLOCKLISTED ) {
2023-06-12 09:04:07 +08:00
pr_warn_client ( cl , " mds%d session blocklisted \n " ,
session - > s_mds ) ;
2021-09-27 19:22:27 +05:30
blocklisted = true ;
}
}
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2017-03-29 15:30:24 +08:00
if ( op = = CEPH_SESSION_CLOSE ) {
2019-12-19 19:44:09 -05:00
ceph_get_mds_session ( session ) ;
2010-02-22 15:12:16 -08:00
__unregister_session ( mdsc , session ) ;
2017-03-29 15:30:24 +08:00
}
2009-10-06 11:31:09 -07:00
/* FIXME: this ttl calculation is generous */
session - > s_ttl = jiffies + HZ * mdsc - > mdsmap - > m_session_autoclose ;
mutex_unlock ( & mdsc - > mutex ) ;
mutex_lock ( & session - > s_mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d %s %p state %s seq %llu \n " , mds ,
ceph_session_op_name ( op ) , session ,
ceph_session_state_name ( session - > s_state ) , seq ) ;
2009-10-06 11:31:09 -07:00
if ( session - > s_state = = CEPH_MDS_SESSION_HUNG ) {
session - > s_state = CEPH_MDS_SESSION_OPEN ;
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d came back \n " , session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
}
switch ( op ) {
case CEPH_SESSION_OPEN :
2010-03-18 14:45:05 -07:00
if ( session - > s_state = = CEPH_MDS_SESSION_RECONNECTING )
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d reconnect success \n " ,
session - > s_mds ) ;
2022-05-26 13:21:31 +08:00
if ( session - > s_state = = CEPH_MDS_SESSION_OPEN ) {
2023-06-12 09:04:07 +08:00
pr_notice_client ( cl , " mds%d is already opened \n " ,
session - > s_mds ) ;
2022-05-26 13:21:31 +08:00
} else {
session - > s_state = CEPH_MDS_SESSION_OPEN ;
session - > s_features = features ;
renewed_caps ( mdsc , session , 0 ) ;
if ( test_bit ( CEPHFS_FEATURE_METRIC_COLLECT ,
& session - > s_features ) )
metric_schedule_delayed ( & mdsc - > metric ) ;
}
/*
* The connection maybe broken and the session in client
* side has been reinitialized , need to update the seq
* anyway .
*/
if ( ! session - > s_seq & & seq )
session - > s_seq = seq ;
2009-10-06 11:31:09 -07:00
wake = 1 ;
if ( mdsc - > stopping )
__close_session ( mdsc , session ) ;
break ;
case CEPH_SESSION_RENEWCAPS :
if ( session - > s_renew_seq = = seq )
renewed_caps ( mdsc , session , 1 ) ;
break ;
case CEPH_SESSION_CLOSE :
2010-03-18 14:45:05 -07:00
if ( session - > s_state = = CEPH_MDS_SESSION_RECONNECTING )
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d reconnect denied \n " ,
session - > s_mds ) ;
2019-12-05 22:35:51 -05:00
session - > s_state = CEPH_MDS_SESSION_CLOSED ;
2015-03-24 20:15:36 +08:00
cleanup_session_requests ( mdsc , session ) ;
2009-10-06 11:31:09 -07:00
remove_session_caps ( session ) ;
2014-09-11 14:25:18 +08:00
wake = 2 ; /* for good measure */
2010-08-11 14:51:23 -07:00
wake_up_all ( & mdsc - > session_close_wq ) ;
2009-10-06 11:31:09 -07:00
break ;
case CEPH_SESSION_STALE :
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d caps went stale, renewing \n " ,
session - > s_mds ) ;
2021-06-04 12:03:09 -04:00
atomic_inc ( & session - > s_cap_gen ) ;
2012-01-12 17:48:11 -08:00
session - > s_cap_ttl = jiffies - 1 ;
2009-10-06 11:31:09 -07:00
send_renew_caps ( mdsc , session ) ;
break ;
case CEPH_SESSION_RECALL_STATE :
2018-01-24 21:24:33 +08:00
ceph_trim_caps ( mdsc , session , le32_to_cpu ( h - > max_caps ) ) ;
2009-10-06 11:31:09 -07:00
break ;
2013-11-22 14:48:37 +08:00
case CEPH_SESSION_FLUSHMSG :
2023-02-07 13:04:52 +08:00
/* flush cap releases */
spin_lock ( & session - > s_cap_lock ) ;
if ( session - > s_num_cap_releases )
ceph_flush_cap_releases ( mdsc , session ) ;
spin_unlock ( & session - > s_cap_lock ) ;
2013-11-22 14:48:37 +08:00
send_flushmsg_ack ( mdsc , session , seq ) ;
break ;
2015-01-05 11:04:04 +08:00
case CEPH_SESSION_FORCE_RO :
2023-06-12 09:04:07 +08:00
doutc ( cl , " force_session_readonly %p \n " , session ) ;
2015-01-05 11:04:04 +08:00
spin_lock ( & session - > s_cap_lock ) ;
session - > s_readonly = true ;
spin_unlock ( & session - > s_cap_lock ) ;
2018-12-10 16:35:09 +08:00
wake_up_session_caps ( session , FORCE_RO ) ;
2015-01-05 11:04:04 +08:00
break ;
2016-09-14 16:39:51 +08:00
case CEPH_SESSION_REJECT :
WARN_ON ( session - > s_state ! = CEPH_MDS_SESSION_OPENING ) ;
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d rejected session \n " ,
session - > s_mds ) ;
2016-09-14 16:39:51 +08:00
session - > s_state = CEPH_MDS_SESSION_REJECTED ;
cleanup_session_requests ( mdsc , session ) ;
remove_session_caps ( session ) ;
2020-09-14 13:39:19 +02:00
if ( blocklisted )
mdsc - > fsc - > blocklisted = true ;
2016-09-14 16:39:51 +08:00
wake = 2 ; /* for good measure */
break ;
2009-10-06 11:31:09 -07:00
default :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " bad op %d mds%d \n " , op , mds ) ;
2009-10-06 11:31:09 -07:00
WARN_ON ( 1 ) ;
}
mutex_unlock ( & session - > s_mutex ) ;
if ( wake ) {
mutex_lock ( & mdsc - > mutex ) ;
__wake_requests ( mdsc , & session - > s_waiting ) ;
2014-09-11 14:25:18 +08:00
if ( wake = = 2 )
kick_requests ( mdsc , mds ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
}
2017-03-29 15:30:24 +08:00
if ( op = = CEPH_SESSION_CLOSE )
ceph_put_mds_session ( session ) ;
2009-10-06 11:31:09 -07:00
return ;
bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " corrupt message mds%d len %d \n " , mds ,
( int ) msg - > front . iov_len ) ;
2009-12-14 15:13:47 -08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
return ;
}
2020-02-18 14:12:45 -05:00
void ceph_mdsc_release_dir_caps ( struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = req - > r_mdsc - > fsc - > client ;
2020-02-18 14:12:45 -05:00
int dcaps ;
dcaps = xchg ( & req - > r_dir_caps , 0 ) ;
if ( dcaps ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " releasing r_dir_caps=%s \n " , ceph_cap_string ( dcaps ) ) ;
2020-02-18 14:12:45 -05:00
ceph_put_cap_refs ( ceph_inode ( req - > r_parent ) , dcaps ) ;
}
}
2020-05-27 09:09:27 -04:00
void ceph_mdsc_release_dir_caps_no_check ( struct ceph_mds_request * req )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = req - > r_mdsc - > fsc - > client ;
2020-05-27 09:09:27 -04:00
int dcaps ;
dcaps = xchg ( & req - > r_dir_caps , 0 ) ;
if ( dcaps ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " releasing r_dir_caps=%s \n " , ceph_cap_string ( dcaps ) ) ;
2020-05-27 09:09:27 -04:00
ceph_put_cap_refs_no_check_caps ( ceph_inode ( req - > r_parent ) ,
dcaps ) ;
}
}
2009-10-06 11:31:09 -07:00
/*
* called under session - > mutex .
*/
static void replay_unsafe_requests ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
{
struct ceph_mds_request * req , * nreq ;
2015-02-04 14:26:22 +08:00
struct rb_node * p ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " mds%d \n " , session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2019-12-05 20:50:21 -05:00
list_for_each_entry_safe ( req , nreq , & session - > s_unsafe , r_unsafe_item )
2020-12-09 08:24:18 -05:00
__send_request ( session , req , true ) ;
2015-02-04 14:26:22 +08:00
/*
* also re - send old requests when MDS enters reconnect stage . So that MDS
* can process completed request in clientreplay stage .
*/
p = rb_first ( & mdsc - > request_tree ) ;
while ( p ) {
req = rb_entry ( p , struct ceph_mds_request , r_node ) ;
p = rb_next ( p ) ;
2017-02-01 13:49:09 -05:00
if ( test_bit ( CEPH_MDS_R_GOT_UNSAFE , & req - > r_req_flags ) )
2015-02-04 14:26:22 +08:00
continue ;
if ( req - > r_attempts = = 0 )
continue ; /* only old requests */
2020-02-18 14:12:45 -05:00
if ( ! req - > r_session )
continue ;
if ( req - > r_session - > s_mds ! = session - > s_mds )
continue ;
2020-05-27 09:09:27 -04:00
ceph_mdsc_release_dir_caps_no_check ( req ) ;
2020-02-18 14:12:45 -05:00
2020-12-09 08:24:18 -05:00
__send_request ( session , req , true ) ;
2015-02-04 14:26:22 +08:00
}
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
}
2019-01-01 16:28:33 +08:00
static int send_reconnect_partial ( struct ceph_reconnect_state * recon_state )
{
struct ceph_msg * reply ;
struct ceph_pagelist * _pagelist ;
struct page * page ;
__le32 * addr ;
int err = - ENOMEM ;
if ( ! recon_state - > allow_multi )
return - ENOSPC ;
/* can't handle message that contains both caps and realm */
BUG_ON ( ! recon_state - > nr_caps = = ! recon_state - > nr_realms ) ;
/* pre-allocate new pagelist */
_pagelist = ceph_pagelist_alloc ( GFP_NOFS ) ;
if ( ! _pagelist )
return - ENOMEM ;
reply = ceph_msg_new2 ( CEPH_MSG_CLIENT_RECONNECT , 0 , 1 , GFP_NOFS , false ) ;
if ( ! reply )
goto fail_msg ;
/* placeholder for nr_caps */
err = ceph_pagelist_encode_32 ( _pagelist , 0 ) ;
if ( err < 0 )
goto fail ;
if ( recon_state - > nr_caps ) {
/* currently encoding caps */
err = ceph_pagelist_encode_32 ( recon_state - > pagelist , 0 ) ;
if ( err )
goto fail ;
} else {
/* placeholder for nr_realms (currently encoding relams) */
err = ceph_pagelist_encode_32 ( _pagelist , 0 ) ;
if ( err < 0 )
goto fail ;
}
err = ceph_pagelist_encode_8 ( recon_state - > pagelist , 1 ) ;
if ( err )
goto fail ;
page = list_first_entry ( & recon_state - > pagelist - > head , struct page , lru ) ;
addr = kmap_atomic ( page ) ;
if ( recon_state - > nr_caps ) {
/* currently encoding caps */
* addr = cpu_to_le32 ( recon_state - > nr_caps ) ;
} else {
/* currently encoding relams */
* ( addr + 1 ) = cpu_to_le32 ( recon_state - > nr_realms ) ;
}
kunmap_atomic ( addr ) ;
reply - > hdr . version = cpu_to_le16 ( 5 ) ;
reply - > hdr . compat_version = cpu_to_le16 ( 4 ) ;
reply - > hdr . data_len = cpu_to_le32 ( recon_state - > pagelist - > length ) ;
ceph_msg_data_add_pagelist ( reply , recon_state - > pagelist ) ;
ceph_con_send ( & recon_state - > session - > s_con , reply ) ;
ceph_pagelist_release ( recon_state - > pagelist ) ;
recon_state - > pagelist = _pagelist ;
recon_state - > nr_caps = 0 ;
recon_state - > nr_realms = 0 ;
recon_state - > msg_version = 5 ;
return 0 ;
fail :
ceph_msg_put ( reply ) ;
fail_msg :
ceph_pagelist_release ( _pagelist ) ;
return err ;
}
2020-08-11 15:23:03 +08:00
static struct dentry * d_find_primary ( struct inode * inode )
{
struct dentry * alias , * dn = NULL ;
if ( hlist_empty ( & inode - > i_dentry ) )
return NULL ;
spin_lock ( & inode - > i_lock ) ;
if ( hlist_empty ( & inode - > i_dentry ) )
goto out_unlock ;
if ( S_ISDIR ( inode - > i_mode ) ) {
alias = hlist_entry ( inode - > i_dentry . first , struct dentry , d_u . d_alias ) ;
if ( ! IS_ROOT ( alias ) )
dn = dget ( alias ) ;
goto out_unlock ;
}
hlist_for_each_entry ( alias , & inode - > i_dentry , d_u . d_alias ) {
spin_lock ( & alias - > d_lock ) ;
if ( ! d_unhashed ( alias ) & &
( ceph_dentry ( alias ) - > flags & CEPH_DENTRY_PRIMARY_LINK ) ) {
dn = dget_dlock ( alias ) ;
}
spin_unlock ( & alias - > d_lock ) ;
if ( dn )
break ;
}
out_unlock :
spin_unlock ( & inode - > i_lock ) ;
return dn ;
}
2009-10-06 11:31:09 -07:00
/*
* Encode information about a cap for a reconnect with the MDS .
*/
2023-04-19 10:39:14 +08:00
static int reconnect_caps_cb ( struct inode * inode , int mds , void * arg )
2009-10-06 11:31:09 -07:00
{
2023-06-09 15:15:47 +08:00
struct ceph_mds_client * mdsc = ceph_sb_to_mdsc ( inode - > i_sb ) ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = ceph_inode_to_client ( inode ) ;
2010-05-12 15:21:32 -07:00
union {
struct ceph_mds_cap_reconnect v2 ;
struct ceph_mds_cap_reconnect_v1 v1 ;
} rec ;
2023-04-19 10:39:14 +08:00
struct ceph_inode_info * ci = ceph_inode ( inode ) ;
2010-05-12 15:21:32 -07:00
struct ceph_reconnect_state * recon_state = arg ;
struct ceph_pagelist * pagelist = recon_state - > pagelist ;
2020-08-11 15:23:03 +08:00
struct dentry * dentry ;
2023-04-19 10:39:14 +08:00
struct ceph_cap * cap ;
2020-08-11 15:23:03 +08:00
char * path ;
2023-05-08 14:45:01 +08:00
int pathlen = 0 , err ;
2020-08-11 15:23:03 +08:00
u64 pathbase ;
2016-07-05 09:32:31 +08:00
u64 snap_follows ;
2009-10-06 11:31:09 -07:00
2020-08-11 15:23:03 +08:00
dentry = d_find_primary ( inode ) ;
if ( dentry ) {
/* set pathbase to parent dir when msg_version >= 2 */
2023-06-09 15:15:47 +08:00
path = ceph_mdsc_build_path ( mdsc , dentry , & pathlen , & pathbase ,
2020-08-11 15:23:03 +08:00
recon_state - > msg_version > = 2 ) ;
dput ( dentry ) ;
if ( IS_ERR ( path ) ) {
err = PTR_ERR ( path ) ;
goto out_err ;
}
} else {
path = NULL ;
pathbase = 0 ;
}
2011-11-30 09:47:09 -08:00
spin_lock ( & ci - > i_ceph_lock ) ;
2023-04-19 10:39:14 +08:00
cap = __get_cap_for_mds ( ci , mds ) ;
if ( ! cap ) {
spin_unlock ( & ci - > i_ceph_lock ) ;
2023-05-08 14:45:01 +08:00
err = 0 ;
2023-04-19 10:39:14 +08:00
goto out_err ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " adding %p ino %llx.%llx cap %p %lld %s \n " , inode ,
ceph_vinop ( inode ) , cap , cap - > cap_id ,
ceph_cap_string ( cap - > issued ) ) ;
2023-04-19 10:39:14 +08:00
2009-10-06 11:31:09 -07:00
cap - > seq = 0 ; /* reset cap seq */
cap - > issue_seq = 0 ; /* and issue_seq */
2013-05-31 16:25:36 +08:00
cap - > mseq = 0 ; /* and migrate_seq */
2021-06-04 12:03:09 -04:00
cap - > cap_gen = atomic_read ( & cap - > session - > s_cap_gen ) ;
2010-05-12 15:21:32 -07:00
2020-02-18 14:12:45 -05:00
/* These are lost when the session goes away */
2020-01-02 07:11:38 -05:00
if ( S_ISDIR ( inode - > i_mode ) ) {
if ( cap - > issued & CEPH_CAP_DIR_CREATE ) {
ceph_put_string ( rcu_dereference_raw ( ci - > i_cached_layout . pool_ns ) ) ;
memset ( & ci - > i_cached_layout , 0 , sizeof ( ci - > i_cached_layout ) ) ;
}
2020-02-18 14:12:45 -05:00
cap - > issued & = ~ CEPH_CAP_ANY_DIR_OPS ;
2020-01-02 07:11:38 -05:00
}
2020-02-18 14:12:45 -05:00
2016-07-04 22:05:18 +08:00
if ( recon_state - > msg_version > = 2 ) {
2010-05-12 15:21:32 -07:00
rec . v2 . cap_id = cpu_to_le64 ( cap - > cap_id ) ;
rec . v2 . wanted = cpu_to_le32 ( __ceph_caps_wanted ( ci ) ) ;
rec . v2 . issued = cpu_to_le32 ( cap - > issued ) ;
rec . v2 . snaprealm = cpu_to_le64 ( ci - > i_snap_realm - > ino ) ;
2020-08-11 15:23:03 +08:00
rec . v2 . pathbase = cpu_to_le64 ( pathbase ) ;
2017-10-31 15:51:14 -04:00
rec . v2 . flock_len = ( __force __le32 )
( ( ci - > i_ceph_flags & CEPH_I_ERROR_FILELOCK ) ? 0 : 1 ) ;
2010-05-12 15:21:32 -07:00
} else {
2023-10-04 14:52:09 -04:00
struct timespec64 ts ;
2010-05-12 15:21:32 -07:00
rec . v1 . cap_id = cpu_to_le64 ( cap - > cap_id ) ;
rec . v1 . wanted = cpu_to_le32 ( __ceph_caps_wanted ( ci ) ) ;
rec . v1 . issued = cpu_to_le32 ( cap - > issued ) ;
2021-04-09 15:58:35 -04:00
rec . v1 . size = cpu_to_le64 ( i_size_read ( inode ) ) ;
2023-10-04 14:52:09 -04:00
ts = inode_get_mtime ( inode ) ;
ceph_encode_timespec64 ( & rec . v1 . mtime , & ts ) ;
ts = inode_get_atime ( inode ) ;
ceph_encode_timespec64 ( & rec . v1 . atime , & ts ) ;
2010-05-12 15:21:32 -07:00
rec . v1 . snaprealm = cpu_to_le64 ( ci - > i_snap_realm - > ino ) ;
2020-08-11 15:23:03 +08:00
rec . v1 . pathbase = cpu_to_le64 ( pathbase ) ;
2010-05-12 15:21:32 -07:00
}
2016-07-05 09:32:31 +08:00
if ( list_empty ( & ci - > i_cap_snaps ) ) {
2017-08-16 21:42:39 +08:00
snap_follows = ci - > i_head_snapc ? ci - > i_head_snapc - > seq : 0 ;
2016-07-05 09:32:31 +08:00
} else {
struct ceph_cap_snap * capsnap =
list_first_entry ( & ci - > i_cap_snaps ,
struct ceph_cap_snap , ci_item ) ;
snap_follows = capsnap - > follows ;
2010-05-12 15:21:32 -07:00
}
2011-11-30 09:47:09 -08:00
spin_unlock ( & ci - > i_ceph_lock ) ;
2009-10-06 11:31:09 -07:00
2016-07-04 22:05:18 +08:00
if ( recon_state - > msg_version > = 2 ) {
2010-08-02 15:34:23 -07:00
int num_fcntl_locks , num_flock_locks ;
2017-09-11 10:36:28 +08:00
struct ceph_filelock * flocks = NULL ;
2019-01-01 16:28:33 +08:00
size_t struct_len , total_len = sizeof ( u64 ) ;
2016-07-04 22:05:18 +08:00
u8 struct_v = 0 ;
2013-05-15 13:03:35 -05:00
encode_again :
2017-09-11 10:58:55 +08:00
if ( rec . v2 . flock_len ) {
ceph_count_locks ( inode , & num_fcntl_locks , & num_flock_locks ) ;
} else {
num_fcntl_locks = 0 ;
num_flock_locks = 0 ;
}
2017-09-11 10:36:28 +08:00
if ( num_fcntl_locks + num_flock_locks > 0 ) {
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 13:55:00 -07:00
flocks = kmalloc_array ( num_fcntl_locks + num_flock_locks ,
sizeof ( struct ceph_filelock ) ,
GFP_NOFS ) ;
2017-09-11 10:36:28 +08:00
if ( ! flocks ) {
err = - ENOMEM ;
2018-12-13 16:34:11 +08:00
goto out_err ;
2017-09-11 10:36:28 +08:00
}
err = ceph_encode_locks_to_buffer ( inode , flocks ,
num_fcntl_locks ,
num_flock_locks ) ;
if ( err ) {
kfree ( flocks ) ;
flocks = NULL ;
if ( err = = - ENOSPC )
goto encode_again ;
2018-12-13 16:34:11 +08:00
goto out_err ;
2017-09-11 10:36:28 +08:00
}
} else {
2013-05-15 13:03:35 -05:00
kfree ( flocks ) ;
2017-09-11 10:36:28 +08:00
flocks = NULL ;
2013-05-15 13:03:35 -05:00
}
2016-07-04 22:05:18 +08:00
if ( recon_state - > msg_version > = 3 ) {
/* version, compat_version and struct_len */
2019-01-01 16:28:33 +08:00
total_len + = 2 * sizeof ( u8 ) + sizeof ( u32 ) ;
2016-07-05 09:32:31 +08:00
struct_v = 2 ;
2016-07-04 22:05:18 +08:00
}
2013-05-15 13:03:35 -05:00
/*
* number of encoded locks is stable , so copy to pagelist
*/
2016-07-04 22:05:18 +08:00
struct_len = 2 * sizeof ( u32 ) +
( num_fcntl_locks + num_flock_locks ) *
sizeof ( struct ceph_filelock ) ;
rec . v2 . flock_len = cpu_to_le32 ( struct_len ) ;
2020-08-11 15:23:03 +08:00
struct_len + = sizeof ( u32 ) + pathlen + sizeof ( rec . v2 ) ;
2016-07-04 22:05:18 +08:00
2016-07-05 09:32:31 +08:00
if ( struct_v > = 2 )
struct_len + = sizeof ( u64 ) ; /* snap_follows */
2016-07-04 22:05:18 +08:00
total_len + = struct_len ;
2019-01-01 16:28:33 +08:00
if ( pagelist - > length + total_len > RECONNECT_MAX_SIZE ) {
err = send_reconnect_partial ( recon_state ) ;
if ( err )
goto out_freeflocks ;
pagelist = recon_state - > pagelist ;
2018-12-13 16:34:11 +08:00
}
2016-07-04 22:05:18 +08:00
2019-01-01 16:28:33 +08:00
err = ceph_pagelist_reserve ( pagelist , total_len ) ;
if ( err )
goto out_freeflocks ;
ceph_pagelist_encode_64 ( pagelist , ceph_ino ( inode ) ) ;
2018-12-13 16:34:11 +08:00
if ( recon_state - > msg_version > = 3 ) {
ceph_pagelist_encode_8 ( pagelist , struct_v ) ;
ceph_pagelist_encode_8 ( pagelist , 1 ) ;
ceph_pagelist_encode_32 ( pagelist , struct_len ) ;
2016-07-04 22:05:18 +08:00
}
2020-08-11 15:23:03 +08:00
ceph_pagelist_encode_string ( pagelist , path , pathlen ) ;
2018-12-13 16:34:11 +08:00
ceph_pagelist_append ( pagelist , & rec , sizeof ( rec . v2 ) ) ;
ceph_locks_to_pagelist ( flocks , pagelist ,
num_fcntl_locks , num_flock_locks ) ;
if ( struct_v > = 2 )
ceph_pagelist_encode_64 ( pagelist , snap_follows ) ;
2019-01-01 16:28:33 +08:00
out_freeflocks :
2013-05-15 13:03:35 -05:00
kfree ( flocks ) ;
2010-09-07 15:59:27 -07:00
} else {
2018-12-13 16:34:11 +08:00
err = ceph_pagelist_reserve ( pagelist ,
2019-01-01 16:28:33 +08:00
sizeof ( u64 ) + sizeof ( u32 ) +
pathlen + sizeof ( rec . v1 ) ) ;
2020-08-11 15:23:03 +08:00
if ( err )
goto out_err ;
2018-12-13 16:34:11 +08:00
2019-01-01 16:28:33 +08:00
ceph_pagelist_encode_64 ( pagelist , ceph_ino ( inode ) ) ;
2018-12-13 16:34:11 +08:00
ceph_pagelist_encode_string ( pagelist , path , pathlen ) ;
ceph_pagelist_append ( pagelist , & rec , sizeof ( rec . v1 ) ) ;
2010-08-02 15:34:23 -07:00
}
2013-09-22 10:28:10 +08:00
2018-12-13 16:34:11 +08:00
out_err :
2020-08-11 15:23:03 +08:00
ceph_mdsc_free_path ( path , pathlen ) ;
if ( ! err )
2019-01-01 16:28:33 +08:00
recon_state - > nr_caps + + ;
return err ;
}
static int encode_snap_realms ( struct ceph_mds_client * mdsc ,
struct ceph_reconnect_state * recon_state )
{
struct rb_node * p ;
struct ceph_pagelist * pagelist = recon_state - > pagelist ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-01-01 16:28:33 +08:00
int err = 0 ;
if ( recon_state - > msg_version > = 4 ) {
err = ceph_pagelist_encode_32 ( pagelist , mdsc - > num_snap_realms ) ;
if ( err < 0 )
goto fail ;
}
/*
* snaprealms . we provide mds with the ino , seq ( version ) , and
* parent for all of our realms . If the mds has any newer info ,
* it will tell us .
*/
for ( p = rb_first ( & mdsc - > snap_realms ) ; p ; p = rb_next ( p ) ) {
struct ceph_snap_realm * realm =
rb_entry ( p , struct ceph_snap_realm , node ) ;
struct ceph_mds_snaprealm_reconnect sr_rec ;
if ( recon_state - > msg_version > = 4 ) {
size_t need = sizeof ( u8 ) * 2 + sizeof ( u32 ) +
sizeof ( sr_rec ) ;
if ( pagelist - > length + need > RECONNECT_MAX_SIZE ) {
err = send_reconnect_partial ( recon_state ) ;
if ( err )
goto fail ;
pagelist = recon_state - > pagelist ;
}
err = ceph_pagelist_reserve ( pagelist , need ) ;
if ( err )
goto fail ;
ceph_pagelist_encode_8 ( pagelist , 1 ) ;
ceph_pagelist_encode_8 ( pagelist , 1 ) ;
ceph_pagelist_encode_32 ( pagelist , sizeof ( sr_rec ) ) ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " adding snap realm %llx seq %lld parent %llx \n " ,
realm - > ino , realm - > seq , realm - > parent_ino ) ;
2019-01-01 16:28:33 +08:00
sr_rec . ino = cpu_to_le64 ( realm - > ino ) ;
sr_rec . seq = cpu_to_le64 ( realm - > seq ) ;
sr_rec . parent = cpu_to_le64 ( realm - > parent_ino ) ;
err = ceph_pagelist_append ( pagelist , & sr_rec , sizeof ( sr_rec ) ) ;
if ( err )
goto fail ;
recon_state - > nr_realms + + ;
}
fail :
2009-12-23 12:21:51 -08:00
return err ;
2009-10-06 11:31:09 -07:00
}
/*
* If an MDS fails and recovers , clients need to reconnect in order to
* reestablish shared state . This includes all caps issued through
* this session _and_ the snap_realm hierarchy . Because it ' s not
* clear which snap realms the mds cares about , we send everything we
* know about . . that ensures we ' ll then get any new info the
* recovering MDS might have .
*
* This is a relatively heavyweight operation , but it ' s rare .
*/
2010-05-10 16:31:25 -07:00
static void send_mds_reconnect ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * reply ;
2010-05-10 16:31:25 -07:00
int mds = session - > s_mds ;
2010-05-10 21:58:38 -07:00
int err = - ENOMEM ;
2019-01-01 16:28:33 +08:00
struct ceph_reconnect_state recon_state = {
. session = session ,
} ;
2017-10-19 08:53:58 -04:00
LIST_HEAD ( dispose ) ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d reconnect start \n " , mds ) ;
2009-10-06 11:31:09 -07:00
2019-01-01 16:28:33 +08:00
recon_state . pagelist = ceph_pagelist_alloc ( GFP_NOFS ) ;
if ( ! recon_state . pagelist )
2009-12-23 12:21:51 -08:00
goto fail_nopagelist ;
2018-10-15 17:38:23 +02:00
reply = ceph_msg_new2 ( CEPH_MSG_CLIENT_RECONNECT , 0 , 1 , GFP_NOFS , false ) ;
2010-04-01 16:06:19 -07:00
if ( ! reply )
2009-12-23 12:21:51 -08:00
goto fail_nomsg ;
2019-11-15 11:51:55 -05:00
xa_destroy ( & session - > s_delegated_inos ) ;
2010-05-10 16:31:25 -07:00
mutex_lock ( & session - > s_mutex ) ;
session - > s_state = CEPH_MDS_SESSION_RECONNECTING ;
session - > s_seq = 0 ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " session %p state %s \n " , session ,
ceph_session_state_name ( session - > s_state ) ) ;
2009-10-06 11:31:09 -07:00
2021-06-04 12:03:09 -04:00
atomic_inc ( & session - > s_cap_gen ) ;
2013-09-22 11:08:14 +08:00
spin_lock ( & session - > s_cap_lock ) ;
2015-01-05 11:04:04 +08:00
/* don't know if session is readonly */
session - > s_readonly = 0 ;
2013-09-22 11:08:14 +08:00
/*
* notify __ceph_remove_cap ( ) that we are composing cap reconnect .
* If a cap get released before being added to the cap reconnect ,
* __ceph_remove_cap ( ) should skip queuing cap release .
*/
session - > s_cap_reconnect = 1 ;
2010-05-10 15:36:44 -07:00
/* drop old cap expires; we're about to reestablish that state */
2017-10-19 08:53:58 -04:00
detach_cap_releases ( session , & dispose ) ;
spin_unlock ( & session - > s_cap_lock ) ;
dispose_cap_releases ( mdsc , & dispose ) ;
2010-05-10 15:36:44 -07:00
2014-09-10 16:56:23 +08:00
/* trim unused caps to reduce MDS's cache rejoin time */
2015-04-07 15:51:08 +08:00
if ( mdsc - > fsc - > sb - > s_root )
shrink_dcache_parent ( mdsc - > fsc - > sb - > s_root ) ;
2014-09-10 16:56:23 +08:00
ceph_con_close ( & session - > s_con ) ;
ceph_con_open ( & session - > s_con ,
CEPH_ENTITY_TYPE_MDS , mds ,
ceph_mdsmap_get_addr ( mdsc - > mdsmap , mds ) ) ;
/* replay unsafe requests */
replay_unsafe_requests ( mdsc , session ) ;
2019-01-01 16:28:33 +08:00
ceph_early_kick_flushing_caps ( mdsc , session ) ;
2014-09-10 16:56:23 +08:00
down_read ( & mdsc - > snap_rwsem ) ;
2019-01-01 16:28:33 +08:00
/* placeholder for nr_caps */
err = ceph_pagelist_encode_32 ( recon_state . pagelist , 0 ) ;
2009-12-23 12:21:51 -08:00
if ( err )
goto fail ;
2010-05-12 15:21:32 -07:00
2019-01-01 16:28:33 +08:00
if ( test_bit ( CEPHFS_FEATURE_MULTI_RECONNECT , & session - > s_features ) ) {
2016-07-04 22:05:18 +08:00
recon_state . msg_version = 3 ;
2019-01-01 16:28:33 +08:00
recon_state . allow_multi = true ;
} else if ( session - > s_con . peer_features & CEPH_FEATURE_MDSENC ) {
recon_state . msg_version = 3 ;
} else {
2018-11-08 14:55:21 +01:00
recon_state . msg_version = 2 ;
2019-01-01 16:28:33 +08:00
}
/* trsaverse this session's caps */
2020-02-18 14:12:45 -05:00
err = ceph_iterate_session_caps ( session , reconnect_caps_cb , & recon_state ) ;
2009-10-06 11:31:09 -07:00
2013-09-22 11:08:14 +08:00
spin_lock ( & session - > s_cap_lock ) ;
session - > s_cap_reconnect = 0 ;
spin_unlock ( & session - > s_cap_lock ) ;
2019-01-01 16:28:33 +08:00
if ( err < 0 )
goto fail ;
2009-10-06 11:31:09 -07:00
2019-01-01 16:28:33 +08:00
/* check if all realms can be encoded into current message */
if ( mdsc - > num_snap_realms ) {
size_t total_len =
recon_state . pagelist - > length +
mdsc - > num_snap_realms *
sizeof ( struct ceph_mds_snaprealm_reconnect ) ;
if ( recon_state . msg_version > = 4 ) {
/* number of realms */
total_len + = sizeof ( u32 ) ;
/* version, compat_version and struct_len */
total_len + = mdsc - > num_snap_realms *
( 2 * sizeof ( u8 ) + sizeof ( u32 ) ) ;
}
if ( total_len > RECONNECT_MAX_SIZE ) {
if ( ! recon_state . allow_multi ) {
err = - ENOSPC ;
goto fail ;
}
if ( recon_state . nr_caps ) {
err = send_reconnect_partial ( & recon_state ) ;
if ( err )
goto fail ;
}
recon_state . msg_version = 5 ;
}
2009-10-06 11:31:09 -07:00
}
2019-01-01 16:28:33 +08:00
err = encode_snap_realms ( mdsc , & recon_state ) ;
if ( err < 0 )
goto fail ;
if ( recon_state . msg_version > = 5 ) {
err = ceph_pagelist_encode_8 ( recon_state . pagelist , 0 ) ;
if ( err < 0 )
goto fail ;
}
2013-09-22 10:28:10 +08:00
2019-01-01 16:28:33 +08:00
if ( recon_state . nr_caps | | recon_state . nr_realms ) {
struct page * page =
list_first_entry ( & recon_state . pagelist - > head ,
struct page , lru ) ;
2013-09-22 10:28:10 +08:00
__le32 * addr = kmap_atomic ( page ) ;
2019-01-01 16:28:33 +08:00
if ( recon_state . nr_caps ) {
WARN_ON ( recon_state . nr_realms ! = mdsc - > num_snap_realms ) ;
* addr = cpu_to_le32 ( recon_state . nr_caps ) ;
} else if ( recon_state . msg_version > = 4 ) {
* ( addr + 1 ) = cpu_to_le32 ( recon_state . nr_realms ) ;
}
2013-09-22 10:28:10 +08:00
kunmap_atomic ( addr ) ;
2013-03-04 22:29:57 -06:00
}
2013-09-22 10:28:10 +08:00
2019-01-01 16:28:33 +08:00
reply - > hdr . version = cpu_to_le16 ( recon_state . msg_version ) ;
if ( recon_state . msg_version > = 4 )
reply - > hdr . compat_version = cpu_to_le16 ( 4 ) ;
2015-06-10 15:17:56 +08:00
2019-01-01 16:28:33 +08:00
reply - > hdr . data_len = cpu_to_le32 ( recon_state . pagelist - > length ) ;
ceph_msg_data_add_pagelist ( reply , recon_state . pagelist ) ;
2015-06-10 15:17:56 +08:00
2009-10-06 11:31:09 -07:00
ceph_con_send ( & session - > s_con , reply ) ;
2010-05-10 21:58:38 -07:00
mutex_unlock ( & session - > s_mutex ) ;
mutex_lock ( & mdsc - > mutex ) ;
__wake_requests ( mdsc , & session - > s_waiting ) ;
mutex_unlock ( & mdsc - > mutex ) ;
2009-10-06 11:31:09 -07:00
up_read ( & mdsc - > snap_rwsem ) ;
2019-01-01 16:28:33 +08:00
ceph_pagelist_release ( recon_state . pagelist ) ;
2009-10-06 11:31:09 -07:00
return ;
2009-12-23 12:21:51 -08:00
fail :
2009-10-06 11:31:09 -07:00
ceph_msg_put ( reply ) ;
2010-05-10 21:58:38 -07:00
up_read ( & mdsc - > snap_rwsem ) ;
mutex_unlock ( & session - > s_mutex ) ;
2009-12-23 12:21:51 -08:00
fail_nomsg :
2019-01-01 16:28:33 +08:00
ceph_pagelist_release ( recon_state . pagelist ) ;
2009-12-23 12:21:51 -08:00
fail_nopagelist :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " error %d preparing reconnect for mds%d \n " ,
err , mds ) ;
2010-05-10 21:58:38 -07:00
return ;
2009-10-06 11:31:09 -07:00
}
/*
* compare old and new mdsmaps , kicking requests
* and closing out old connections as necessary
*
* called under mdsc - > mutex .
*/
static void check_new_map ( struct ceph_mds_client * mdsc ,
struct ceph_mdsmap * newmap ,
struct ceph_mdsmap * oldmap )
{
2021-08-18 09:31:19 +08:00
int i , j , err ;
2009-10-06 11:31:09 -07:00
int oldstate , newstate ;
struct ceph_mds_session * s ;
2021-08-18 09:31:19 +08:00
unsigned long targets [ DIV_ROUND_UP ( CEPH_MAX_MDS , sizeof ( unsigned long ) ) ] = { 0 } ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " new %u old %u \n " , newmap - > m_epoch , oldmap - > m_epoch ) ;
2009-10-06 11:31:09 -07:00
2021-08-18 09:31:19 +08:00
if ( newmap - > m_info ) {
for ( i = 0 ; i < newmap - > possible_max_rank ; i + + ) {
for ( j = 0 ; j < newmap - > m_info [ i ] . num_export_targets ; j + + )
set_bit ( newmap - > m_info [ i ] . export_targets [ j ] , targets ) ;
}
}
2019-12-04 06:57:39 -05:00
for ( i = 0 ; i < oldmap - > possible_max_rank & & i < mdsc - > max_sessions ; i + + ) {
2017-08-20 20:22:02 +02:00
if ( ! mdsc - > sessions [ i ] )
2009-10-06 11:31:09 -07:00
continue ;
s = mdsc - > sessions [ i ] ;
oldstate = ceph_mdsmap_get_state ( oldmap , i ) ;
newstate = ceph_mdsmap_get_state ( newmap , i ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " mds%d state %s%s -> %s%s (session %s) \n " ,
i , ceph_mds_state_name ( oldstate ) ,
ceph_mdsmap_is_laggy ( oldmap , i ) ? " (laggy) " : " " ,
ceph_mds_state_name ( newstate ) ,
ceph_mdsmap_is_laggy ( newmap , i ) ? " (laggy) " : " " ,
ceph_session_state_name ( s - > s_state ) ) ;
2009-10-06 11:31:09 -07:00
2019-12-04 06:57:39 -05:00
if ( i > = newmap - > possible_max_rank ) {
2019-06-10 15:45:09 +08:00
/* force close session for stopped mds */
2019-12-19 19:44:09 -05:00
ceph_get_mds_session ( s ) ;
2019-06-10 15:45:09 +08:00
__unregister_session ( mdsc , s ) ;
__wake_requests ( mdsc , & s - > s_waiting ) ;
mutex_unlock ( & mdsc - > mutex ) ;
2017-03-28 17:56:29 +08:00
2019-06-10 15:45:09 +08:00
mutex_lock ( & s - > s_mutex ) ;
cleanup_session_requests ( mdsc , s ) ;
remove_session_caps ( s ) ;
mutex_unlock ( & s - > s_mutex ) ;
2017-03-28 17:56:29 +08:00
2019-06-10 15:45:09 +08:00
ceph_put_mds_session ( s ) ;
2017-03-28 17:56:29 +08:00
2019-06-10 15:45:09 +08:00
mutex_lock ( & mdsc - > mutex ) ;
kick_requests ( mdsc , i ) ;
continue ;
}
if ( memcmp ( ceph_mdsmap_get_addr ( oldmap , i ) ,
ceph_mdsmap_get_addr ( newmap , i ) ,
sizeof ( struct ceph_entity_addr ) ) ) {
/* just close it */
mutex_unlock ( & mdsc - > mutex ) ;
mutex_lock ( & s - > s_mutex ) ;
mutex_lock ( & mdsc - > mutex ) ;
ceph_con_close ( & s - > s_con ) ;
mutex_unlock ( & s - > s_mutex ) ;
s - > s_state = CEPH_MDS_SESSION_RESTARTING ;
2009-10-06 11:31:09 -07:00
} else if ( oldstate = = newstate ) {
continue ; /* nothing new with this mds */
}
/*
* send reconnect ?
*/
if ( s - > s_state = = CEPH_MDS_SESSION_RESTARTING & &
2010-05-10 16:31:25 -07:00
newstate > = CEPH_MDS_STATE_RECONNECT ) {
mutex_unlock ( & mdsc - > mutex ) ;
2021-08-18 09:31:19 +08:00
clear_bit ( i , targets ) ;
2010-05-10 16:31:25 -07:00
send_mds_reconnect ( mdsc , s ) ;
mutex_lock ( & mdsc - > mutex ) ;
}
2009-10-06 11:31:09 -07:00
/*
2010-03-18 14:45:05 -07:00
* kick request on any mds that has gone active .
2009-10-06 11:31:09 -07:00
*/
if ( oldstate < CEPH_MDS_STATE_ACTIVE & &
newstate > = CEPH_MDS_STATE_ACTIVE ) {
2010-03-18 14:45:05 -07:00
if ( oldstate ! = CEPH_MDS_STATE_CREATING & &
oldstate ! = CEPH_MDS_STATE_STARTING )
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d recovery completed \n " ,
s - > s_mds ) ;
2010-03-18 14:45:05 -07:00
kick_requests ( mdsc , i ) ;
2020-05-20 03:51:19 -04:00
mutex_unlock ( & mdsc - > mutex ) ;
2020-04-03 13:09:07 -04:00
mutex_lock ( & s - > s_mutex ) ;
2020-05-20 03:51:19 -04:00
mutex_lock ( & mdsc - > mutex ) ;
2009-10-06 11:31:09 -07:00
ceph_kick_flushing_caps ( mdsc , s ) ;
2020-04-03 13:09:07 -04:00
mutex_unlock ( & s - > s_mutex ) ;
2018-12-10 16:35:09 +08:00
wake_up_session_caps ( s , RECONNECT ) ;
2009-10-06 11:31:09 -07:00
}
}
2010-06-21 13:38:35 -07:00
2021-08-18 09:31:19 +08:00
/*
* Only open and reconnect sessions that don ' t exist yet .
*/
for ( i = 0 ; i < newmap - > possible_max_rank ; i + + ) {
/*
* In case the import MDS is crashed just after
* the EImportStart journal is flushed , so when
* a standby MDS takes over it and is replaying
* the EImportStart journal the new MDS daemon
* will wait the client to reconnect it , but the
* client may never register / open the session yet .
*
* Will try to reconnect that MDS daemon if the
* rank number is in the export targets array and
* is the up : reconnect state .
*/
newstate = ceph_mdsmap_get_state ( newmap , i ) ;
if ( ! test_bit ( i , targets ) | | newstate ! = CEPH_MDS_STATE_RECONNECT )
continue ;
/*
* The session maybe registered and opened by some
* requests which were choosing random MDSes during
* the mdsc - > mutex ' s unlock / lock gap below in rare
* case . But the related MDS daemon will just queue
* that requests and be still waiting for the client ' s
* reconnection request in up : reconnect state .
*/
s = __ceph_lookup_mds_session ( mdsc , i ) ;
if ( likely ( ! s ) ) {
s = __open_export_target_session ( mdsc , i ) ;
if ( IS_ERR ( s ) ) {
err = PTR_ERR ( s ) ;
2023-06-12 09:04:07 +08:00
pr_err_client ( cl ,
" failed to open export target session, err %d \n " ,
err ) ;
2021-08-18 09:31:19 +08:00
continue ;
}
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " send reconnect to export target mds.%d \n " , i ) ;
2021-08-18 09:31:19 +08:00
mutex_unlock ( & mdsc - > mutex ) ;
send_mds_reconnect ( mdsc , s ) ;
ceph_put_mds_session ( s ) ;
mutex_lock ( & mdsc - > mutex ) ;
}
2019-12-04 06:57:39 -05:00
for ( i = 0 ; i < newmap - > possible_max_rank & & i < mdsc - > max_sessions ; i + + ) {
2010-06-21 13:38:35 -07:00
s = mdsc - > sessions [ i ] ;
if ( ! s )
continue ;
if ( ! ceph_mdsmap_is_laggy ( newmap , i ) )
continue ;
if ( s - > s_state = = CEPH_MDS_SESSION_OPEN | |
s - > s_state = = CEPH_MDS_SESSION_HUNG | |
s - > s_state = = CEPH_MDS_SESSION_CLOSING ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " connecting to export targets of laggy mds%d \n " , i ) ;
2010-06-21 13:38:35 -07:00
__open_export_target_sessions ( mdsc , s ) ;
}
}
2009-10-06 11:31:09 -07:00
}
/*
* leases
*/
/*
* caller must hold session s_mutex , dentry - > d_lock
*/
void __ceph_mdsc_drop_dentry_lease ( struct dentry * dentry )
{
struct ceph_dentry_info * di = ceph_dentry ( dentry ) ;
ceph_put_mds_session ( di - > lease_session ) ;
di - > lease_session = NULL ;
}
2010-02-22 15:12:16 -08:00
static void handle_lease ( struct ceph_mds_client * mdsc ,
struct ceph_mds_session * session ,
struct ceph_msg * msg )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2010-04-06 15:14:15 -07:00
struct super_block * sb = mdsc - > fsc - > sb ;
2009-10-06 11:31:09 -07:00
struct inode * inode ;
struct dentry * parent , * dentry ;
struct ceph_dentry_info * di ;
2010-02-22 15:12:16 -08:00
int mds = session - > s_mds ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_lease * h = msg - > front . iov_base ;
2010-06-04 10:05:40 -07:00
u32 seq ;
2009-10-06 11:31:09 -07:00
struct ceph_vino vino ;
struct qstr dname ;
int release = 0 ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " from mds%d \n " , mds ) ;
2009-10-06 11:31:09 -07:00
2022-12-21 14:13:51 +08:00
if ( ! ceph_inc_mds_stopping_blocker ( mdsc , session ) )
return ;
2009-10-06 11:31:09 -07:00
/* decode */
if ( msg - > front . iov_len < sizeof ( * h ) + sizeof ( u32 ) )
goto bad ;
vino . ino = le64_to_cpu ( h - > ino ) ;
vino . snap = CEPH_NOSNAP ;
2010-06-04 10:05:40 -07:00
seq = le32_to_cpu ( h - > seq ) ;
2018-08-03 16:24:49 +08:00
dname . len = get_unaligned_le32 ( h + 1 ) ;
if ( msg - > front . iov_len < sizeof ( * h ) + sizeof ( u32 ) + dname . len )
2009-10-06 11:31:09 -07:00
goto bad ;
2018-08-03 16:24:49 +08:00
dname . name = ( void * ) ( h + 1 ) + sizeof ( u32 ) ;
2009-10-06 11:31:09 -07:00
/* lookup inode */
inode = ceph_find_inode ( sb , vino ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " %s, ino %llx %p %.*s \n " , ceph_lease_op_name ( h - > action ) ,
vino . ino , inode , dname . len , dname . name ) ;
2014-09-17 07:45:12 +08:00
mutex_lock ( & session - > s_mutex ) ;
2017-08-20 20:22:02 +02:00
if ( ! inode ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " no inode %llx \n " , vino . ino ) ;
2009-10-06 11:31:09 -07:00
goto release ;
}
/* dentry */
parent = d_find_alias ( inode ) ;
if ( ! parent ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " no parent dentry on inode %p \n " , inode ) ;
2009-10-06 11:31:09 -07:00
WARN_ON ( 1 ) ;
goto release ; /* hrm... */
}
2016-06-10 07:51:30 -07:00
dname . hash = full_name_hash ( parent , dname . name , dname . len ) ;
2009-10-06 11:31:09 -07:00
dentry = d_lookup ( parent , & dname ) ;
dput ( parent ) ;
if ( ! dentry )
goto release ;
spin_lock ( & dentry - > d_lock ) ;
di = ceph_dentry ( dentry ) ;
switch ( h - > action ) {
case CEPH_MDS_LEASE_REVOKE :
2011-11-11 09:48:53 -08:00
if ( di - > lease_session = = session ) {
2010-06-04 10:05:40 -07:00
if ( ceph_seq_cmp ( di - > lease_seq , seq ) > 0 )
h - > seq = cpu_to_le32 ( di - > lease_seq ) ;
2009-10-06 11:31:09 -07:00
__ceph_mdsc_drop_dentry_lease ( dentry ) ;
}
release = 1 ;
break ;
case CEPH_MDS_LEASE_RENEW :
2011-11-11 09:48:53 -08:00
if ( di - > lease_session = = session & &
2021-06-04 12:03:09 -04:00
di - > lease_gen = = atomic_read ( & session - > s_cap_gen ) & &
2009-10-06 11:31:09 -07:00
di - > lease_renew_from & &
di - > lease_renew_after = = 0 ) {
unsigned long duration =
2015-02-06 06:52:17 -05:00
msecs_to_jiffies ( le32_to_cpu ( h - > duration_ms ) ) ;
2009-10-06 11:31:09 -07:00
2010-06-04 10:05:40 -07:00
di - > lease_seq = seq ;
2016-06-22 16:35:04 +02:00
di - > time = di - > lease_renew_from + duration ;
2009-10-06 11:31:09 -07:00
di - > lease_renew_after = di - > lease_renew_from +
( duration > > 1 ) ;
di - > lease_renew_from = 0 ;
}
break ;
}
spin_unlock ( & dentry - > d_lock ) ;
dput ( dentry ) ;
if ( ! release )
goto out ;
release :
/* let's just reuse the same message */
h - > action = CEPH_MDS_LEASE_REVOKE_ACK ;
ceph_msg_get ( msg ) ;
ceph_con_send ( & session - > s_con , msg ) ;
out :
mutex_unlock ( & session - > s_mutex ) ;
2021-06-04 12:03:09 -04:00
iput ( inode ) ;
2022-12-21 14:13:51 +08:00
ceph_dec_mds_stopping_blocker ( mdsc ) ;
2009-10-06 11:31:09 -07:00
return ;
bad :
2022-12-21 14:13:51 +08:00
ceph_dec_mds_stopping_blocker ( mdsc ) ;
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " corrupt lease message \n " ) ;
2009-12-14 15:13:47 -08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
}
void ceph_mdsc_lease_send_msg ( struct ceph_mds_session * session ,
struct dentry * dentry , char action ,
u32 seq )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = session - > s_mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_msg * msg ;
struct ceph_mds_lease * lease ;
2019-05-23 10:45:24 +08:00
struct inode * dir ;
int len = sizeof ( * lease ) + sizeof ( u32 ) + NAME_MAX ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " identry %p %s to mds%d \n " , dentry , ceph_lease_op_name ( action ) ,
session - > s_mds ) ;
2009-10-06 11:31:09 -07:00
2011-08-09 15:03:46 -07:00
msg = ceph_msg_new ( CEPH_MSG_CLIENT_LEASE , len , GFP_NOFS , false ) ;
2010-04-01 16:06:19 -07:00
if ( ! msg )
2009-10-06 11:31:09 -07:00
return ;
lease = msg - > front . iov_base ;
lease - > action = action ;
lease - > seq = cpu_to_le32 ( seq ) ;
2019-05-23 10:45:24 +08:00
spin_lock ( & dentry - > d_lock ) ;
dir = d_inode ( dentry - > d_parent ) ;
lease - > ino = cpu_to_le64 ( ceph_ino ( dir ) ) ;
lease - > first = lease - > last = cpu_to_le64 ( ceph_snap ( dir ) ) ;
put_unaligned_le32 ( dentry - > d_name . len , lease + 1 ) ;
memcpy ( ( void * ) ( lease + 1 ) + 4 ,
dentry - > d_name . name , dentry - > d_name . len ) ;
spin_unlock ( & dentry - > d_lock ) ;
2009-10-06 11:31:09 -07:00
ceph_con_send ( & session - > s_con , msg ) ;
}
/*
2021-07-05 09:22:55 +08:00
* lock unlock the session , to wait ongoing session activities
2009-10-06 11:31:09 -07:00
*/
2021-07-05 09:22:55 +08:00
static void lock_unlock_session ( struct ceph_mds_session * s )
2009-10-06 11:31:09 -07:00
{
2021-07-05 09:22:55 +08:00
mutex_lock ( & s - > s_mutex ) ;
mutex_unlock ( & s - > s_mutex ) ;
2009-10-06 11:31:09 -07:00
}
2019-07-25 20:16:47 +08:00
static void maybe_recover_session ( struct ceph_mds_client * mdsc )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2019-07-25 20:16:47 +08:00
struct ceph_fs_client * fsc = mdsc - > fsc ;
if ( ! ceph_test_mount_opt ( fsc , CLEANRECOVER ) )
return ;
2009-10-06 11:31:09 -07:00
2019-07-25 20:16:47 +08:00
if ( READ_ONCE ( fsc - > mount_state ) ! = CEPH_MOUNT_MOUNTED )
return ;
2020-09-14 13:39:19 +02:00
if ( ! READ_ONCE ( fsc - > blocklisted ) )
2019-07-25 20:16:47 +08:00
return ;
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " auto reconnect after blocklisted \n " ) ;
2019-07-25 20:16:47 +08:00
ceph_force_reconnect ( fsc - > sb ) ;
}
2009-10-06 11:31:09 -07:00
2020-06-30 03:52:15 -04:00
bool check_session_state ( struct ceph_mds_session * s )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = s - > s_mdsc - > fsc - > client ;
2020-10-12 09:39:06 -04:00
switch ( s - > s_state ) {
case CEPH_MDS_SESSION_OPEN :
if ( s - > s_ttl & & time_after ( jiffies , s - > s_ttl ) ) {
2020-06-30 03:52:15 -04:00
s - > s_state = CEPH_MDS_SESSION_HUNG ;
2023-06-12 09:04:07 +08:00
pr_info_client ( cl , " mds%d hung \n " , s - > s_mds ) ;
2020-06-30 03:52:15 -04:00
}
2020-10-12 09:39:06 -04:00
break ;
case CEPH_MDS_SESSION_CLOSING :
case CEPH_MDS_SESSION_NEW :
case CEPH_MDS_SESSION_RESTARTING :
case CEPH_MDS_SESSION_CLOSED :
case CEPH_MDS_SESSION_REJECTED :
2020-06-30 03:52:15 -04:00
return false ;
2020-10-12 09:39:06 -04:00
}
2020-06-30 03:52:15 -04:00
return true ;
}
2020-10-12 09:39:06 -04:00
/*
* If the sequence is incremented while we ' re waiting on a REQUEST_CLOSE reply ,
* then we need to retransmit that request .
*/
void inc_session_sequence ( struct ceph_mds_session * s )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = s - > s_mdsc - > fsc - > client ;
2020-10-12 09:39:06 -04:00
lockdep_assert_held ( & s - > s_mutex ) ;
s - > s_seq + + ;
if ( s - > s_state = = CEPH_MDS_SESSION_CLOSING ) {
int ret ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " resending session close request for mds%d \n " , s - > s_mds ) ;
2020-10-12 09:39:06 -04:00
ret = request_close_session ( s ) ;
if ( ret < 0 )
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " unable to close session to mds%d: %d \n " ,
s - > s_mds , ret ) ;
2020-10-12 09:39:06 -04:00
}
}
2009-10-06 11:31:09 -07:00
/*
2021-07-06 14:52:41 +01:00
* delayed work - - periodically trim expired leases , renew caps with mds . If
* the @ delay parameter is set to 0 or if it ' s more than 5 secs , the default
* workqueue delay value of 5 secs will be used .
2009-10-06 11:31:09 -07:00
*/
2021-07-06 14:52:41 +01:00
static void schedule_delayed ( struct ceph_mds_client * mdsc , unsigned long delay )
2009-10-06 11:31:09 -07:00
{
2021-07-06 14:52:41 +01:00
unsigned long max_delay = HZ * 5 ;
/* 5 secs default delay */
if ( ! delay | | ( delay > max_delay ) )
delay = max_delay ;
schedule_delayed_work ( & mdsc - > delayed_work ,
round_jiffies_relative ( delay ) ) ;
2009-10-06 11:31:09 -07:00
}
static void delayed_work ( struct work_struct * work )
{
struct ceph_mds_client * mdsc =
container_of ( work , struct ceph_mds_client , delayed_work . work ) ;
2021-07-06 14:52:41 +01:00
unsigned long delay ;
2009-10-06 11:31:09 -07:00
int renew_interval ;
int renew_caps ;
2021-07-06 14:52:41 +01:00
int i ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " mdsc delayed_work \n " ) ;
2017-12-14 15:11:09 +08:00
2023-07-25 12:03:59 +08:00
if ( mdsc - > stopping > = CEPH_MDSC_STOPPING_FLUSHED )
2020-07-01 01:52:48 -04:00
return ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
renew_interval = mdsc - > mdsmap - > m_session_timeout > > 2 ;
renew_caps = time_after_eq ( jiffies , HZ * renew_interval +
mdsc - > last_renew_caps ) ;
if ( renew_caps )
mdsc - > last_renew_caps = jiffies ;
for ( i = 0 ; i < mdsc - > max_sessions ; i + + ) {
struct ceph_mds_session * s = __ceph_lookup_mds_session ( mdsc , i ) ;
2017-08-20 20:22:02 +02:00
if ( ! s )
2009-10-06 11:31:09 -07:00
continue ;
2020-06-30 03:52:15 -04:00
if ( ! check_session_state ( s ) ) {
2009-10-06 11:31:09 -07:00
ceph_put_mds_session ( s ) ;
continue ;
}
mutex_unlock ( & mdsc - > mutex ) ;
mutex_lock ( & s - > s_mutex ) ;
if ( renew_caps )
send_renew_caps ( mdsc , s ) ;
else
ceph_con_keepalive ( & s - > s_con ) ;
2010-03-17 16:30:21 -07:00
if ( s - > s_state = = CEPH_MDS_SESSION_OPEN | |
s - > s_state = = CEPH_MDS_SESSION_HUNG )
2010-06-09 16:47:10 -07:00
ceph_send_cap_releases ( mdsc , s ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & s - > s_mutex ) ;
ceph_put_mds_session ( s ) ;
mutex_lock ( & mdsc - > mutex ) ;
}
mutex_unlock ( & mdsc - > mutex ) ;
2021-07-06 14:52:41 +01:00
delay = ceph_check_delayed_caps ( mdsc ) ;
2019-01-31 16:55:51 +08:00
ceph_queue_cap_reclaim_work ( mdsc ) ;
ceph_trim_snapid_map ( mdsc ) ;
2019-07-25 20:16:47 +08:00
maybe_recover_session ( mdsc ) ;
2021-07-06 14:52:41 +01:00
schedule_delayed ( mdsc , delay ) ;
2009-10-06 11:31:09 -07:00
}
2010-04-06 15:14:15 -07:00
int ceph_mdsc_init ( struct ceph_fs_client * fsc )
2009-10-06 11:31:09 -07:00
{
2010-04-06 15:14:15 -07:00
struct ceph_mds_client * mdsc ;
2020-03-19 23:44:59 -04:00
int err ;
2010-04-06 15:14:15 -07:00
mdsc = kzalloc ( sizeof ( struct ceph_mds_client ) , GFP_NOFS ) ;
if ( ! mdsc )
return - ENOMEM ;
mdsc - > fsc = fsc ;
2009-10-06 11:31:09 -07:00
mutex_init ( & mdsc - > mutex ) ;
mdsc - > mdsmap = kzalloc ( sizeof ( * mdsc - > mdsmap ) , GFP_NOFS ) ;
2017-08-20 20:22:02 +02:00
if ( ! mdsc - > mdsmap ) {
2020-03-19 23:44:59 -04:00
err = - ENOMEM ;
goto err_mdsc ;
2013-06-25 14:48:19 +08:00
}
2010-03-26 18:04:40 +08:00
2009-10-06 11:31:09 -07:00
init_completion ( & mdsc - > safe_umount_waiters ) ;
2022-12-21 14:13:51 +08:00
spin_lock_init ( & mdsc - > stopping_lock ) ;
atomic_set ( & mdsc - > stopping_blockers , 0 ) ;
init_completion ( & mdsc - > stopping_waiter ) ;
2010-08-11 14:51:23 -07:00
init_waitqueue_head ( & mdsc - > session_close_wq ) ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & mdsc - > waiting_for_map ) ;
2019-03-21 10:20:10 +00:00
mdsc - > quotarealms_inodes = RB_ROOT ;
mutex_init ( & mdsc - > quotarealms_inodes_mutex ) ;
2009-10-06 11:31:09 -07:00
init_rwsem ( & mdsc - > snap_rwsem ) ;
2010-02-15 14:37:55 -08:00
mdsc - > snap_realms = RB_ROOT ;
2009-10-06 11:31:09 -07:00
INIT_LIST_HEAD ( & mdsc - > snap_empty ) ;
spin_lock_init ( & mdsc - > snap_empty_lock ) ;
2010-02-15 12:08:46 -08:00
mdsc - > request_tree = RB_ROOT ;
2009-10-06 11:31:09 -07:00
INIT_DELAYED_WORK ( & mdsc - > delayed_work , delayed_work ) ;
mdsc - > last_renew_caps = jiffies ;
INIT_LIST_HEAD ( & mdsc - > cap_delay_list ) ;
2019-11-20 12:00:59 -05:00
INIT_LIST_HEAD ( & mdsc - > cap_wait_list ) ;
2009-10-06 11:31:09 -07:00
spin_lock_init ( & mdsc - > cap_delay_lock ) ;
INIT_LIST_HEAD ( & mdsc - > snap_flush_list ) ;
spin_lock_init ( & mdsc - > snap_flush_lock ) ;
2015-06-09 15:48:57 +08:00
mdsc - > last_cap_flush_tid = 1 ;
2016-07-06 11:12:56 +08:00
INIT_LIST_HEAD ( & mdsc - > cap_flush_list ) ;
2011-05-24 11:46:31 -07:00
INIT_LIST_HEAD ( & mdsc - > cap_dirty_migrating ) ;
2009-10-06 11:31:09 -07:00
spin_lock_init ( & mdsc - > cap_dirty_lock ) ;
init_waitqueue_head ( & mdsc - > cap_flushing_wq ) ;
2019-01-31 16:55:51 +08:00
INIT_WORK ( & mdsc - > cap_reclaim_work , ceph_cap_reclaim_work ) ;
2020-03-19 23:44:59 -04:00
err = ceph_metric_init ( & mdsc - > metric ) ;
if ( err )
goto err_mdsmap ;
2019-01-31 16:55:51 +08:00
spin_lock_init ( & mdsc - > dentry_list_lock ) ;
INIT_LIST_HEAD ( & mdsc - > dentry_leases ) ;
INIT_LIST_HEAD ( & mdsc - > dentry_dir_leases ) ;
2010-03-26 18:04:40 +08:00
2010-06-17 16:16:12 -07:00
ceph_caps_init ( mdsc ) ;
2019-02-01 14:57:15 +08:00
ceph_adjust_caps_max_min ( mdsc , fsc - > mount_options ) ;
2010-06-17 16:16:12 -07:00
2017-12-14 15:11:09 +08:00
spin_lock_init ( & mdsc - > snapid_map_lock ) ;
mdsc - > snapid_map_tree = RB_ROOT ;
INIT_LIST_HEAD ( & mdsc - > snapid_map_lru ) ;
2015-04-27 15:33:28 +08:00
init_rwsem ( & mdsc - > pool_perm_rwsem ) ;
mdsc - > pool_perm_tree = RB_ROOT ;
2018-07-02 15:55:23 +08:00
strscpy ( mdsc - > nodename , utsname ( ) - > nodename ,
sizeof ( mdsc - > nodename ) ) ;
2020-07-23 15:32:25 +08:00
fsc - > mdsc = mdsc ;
2009-11-18 14:52:18 -08:00
return 0 ;
2020-03-19 23:44:59 -04:00
err_mdsmap :
kfree ( mdsc - > mdsmap ) ;
err_mdsc :
kfree ( mdsc ) ;
return err ;
2009-10-06 11:31:09 -07:00
}
/*
* Wait for safe replies on open mds requests . If we time out , drop
* all requests from the tree to avoid dangling dentry refs .
*/
static void wait_requests ( struct ceph_mds_client * mdsc )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2015-05-15 12:02:17 +03:00
struct ceph_options * opts = mdsc - > fsc - > client - > options ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_request * req ;
mutex_lock ( & mdsc - > mutex ) ;
2010-02-15 12:08:46 -08:00
if ( __get_oldest_req ( mdsc ) ) {
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2010-02-15 12:08:46 -08:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " waiting for requests \n " ) ;
2009-10-06 11:31:09 -07:00
wait_for_completion_timeout ( & mdsc - > safe_umount_waiters ,
2015-05-15 12:02:17 +03:00
ceph_timeout_jiffies ( opts - > mount_timeout ) ) ;
2009-10-06 11:31:09 -07:00
/* tear down remaining requests */
2010-02-15 12:08:46 -08:00
mutex_lock ( & mdsc - > mutex ) ;
while ( ( req = __get_oldest_req ( mdsc ) ) ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " timed out on tid %llu \n " , req - > r_tid ) ;
2019-06-14 10:55:05 +08:00
list_del_init ( & req - > r_wait ) ;
2010-02-15 12:08:46 -08:00
__unregister_request ( mdsc , req ) ;
2009-10-06 11:31:09 -07:00
}
}
mutex_unlock ( & mdsc - > mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " done \n " ) ;
2009-10-06 11:31:09 -07:00
}
2021-07-05 09:22:56 +08:00
void send_flush_mdlog ( struct ceph_mds_session * s )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = s - > s_mdsc - > fsc - > client ;
2021-07-05 09:22:56 +08:00
struct ceph_msg * msg ;
/*
* Pre - luminous MDS crashes when it sees an unknown session request
*/
if ( ! CEPH_HAVE_FEATURE ( s - > s_con . peer_features , SERVER_LUMINOUS ) )
return ;
mutex_lock ( & s - > s_mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " request mdlog flush to mds%d (%s)s seq %lld \n " ,
s - > s_mds , ceph_session_state_name ( s - > s_state ) , s - > s_seq ) ;
2021-07-05 09:22:56 +08:00
msg = ceph_create_session_msg ( CEPH_SESSION_REQUEST_FLUSH_MDLOG ,
s - > s_seq ) ;
if ( ! msg ) {
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " failed to request mdlog flush to mds%d (%s) seq %lld \n " ,
s - > s_mds , ceph_session_state_name ( s - > s_state ) , s - > s_seq ) ;
2021-07-05 09:22:56 +08:00
} else {
ceph_con_send ( & s - > s_con , msg ) ;
}
mutex_unlock ( & s - > s_mutex ) ;
}
2009-10-06 11:31:09 -07:00
/*
* called before mount is ro , and before dentries are torn down .
* ( hmm , does this still race with new lookups ? )
*/
void ceph_mdsc_pre_umount ( struct ceph_mds_client * mdsc )
{
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " begin \n " ) ;
2023-07-25 12:03:59 +08:00
mdsc - > stopping = CEPH_MDSC_STOPPING_BEGIN ;
2009-10-06 11:31:09 -07:00
2021-07-05 09:22:56 +08:00
ceph_mdsc_iterate_sessions ( mdsc , send_flush_mdlog , true ) ;
2021-07-05 09:22:55 +08:00
ceph_mdsc_iterate_sessions ( mdsc , lock_unlock_session , false ) ;
2009-10-14 14:27:38 -07:00
ceph_flush_dirty_caps ( mdsc ) ;
2009-10-06 11:31:09 -07:00
wait_requests ( mdsc ) ;
2010-06-21 16:12:26 -07:00
/*
* wait for reply handlers to drop their request refs and
* their inode / dcache refs
*/
ceph_msgr_flush ( ) ;
2019-03-21 10:20:10 +00:00
ceph_cleanup_quotarealms_inodes ( mdsc ) ;
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " done \n " ) ;
2009-10-06 11:31:09 -07:00
}
/*
2022-04-19 08:58:49 +08:00
* flush the mdlog and wait for all write mds requests to flush .
2009-10-06 11:31:09 -07:00
*/
2022-04-19 08:58:49 +08:00
static void flush_mdlog_and_wait_mdsc_unsafe_requests ( struct ceph_mds_client * mdsc ,
u64 want_tid )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2010-03-16 15:28:54 -07:00
struct ceph_mds_request * req = NULL , * nextreq ;
2022-04-19 08:58:49 +08:00
struct ceph_mds_session * last_session = NULL ;
2010-02-15 12:08:46 -08:00
struct rb_node * n ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " want %lld \n " , want_tid ) ;
2010-03-16 15:28:54 -07:00
restart :
2010-02-15 12:08:46 -08:00
req = __get_oldest_req ( mdsc ) ;
while ( req & & req - > r_tid < = want_tid ) {
2010-03-16 15:28:54 -07:00
/* find next request */
n = rb_next ( & req - > r_node ) ;
if ( n )
nextreq = rb_entry ( n , struct ceph_mds_request , r_node ) ;
else
nextreq = NULL ;
2015-05-19 18:54:40 +08:00
if ( req - > r_op ! = CEPH_MDS_OP_SETFILELOCK & &
( req - > r_op & CEPH_MDS_OP_WRITE ) ) {
2022-04-19 08:58:49 +08:00
struct ceph_mds_session * s = req - > r_session ;
if ( ! s ) {
req = nextreq ;
continue ;
}
2010-02-15 12:08:46 -08:00
/* write op */
ceph_mdsc_get_request ( req ) ;
2010-03-16 15:28:54 -07:00
if ( nextreq )
ceph_mdsc_get_request ( nextreq ) ;
2022-04-19 08:58:49 +08:00
s = ceph_get_mds_session ( s ) ;
2010-02-15 12:08:46 -08:00
mutex_unlock ( & mdsc - > mutex ) ;
2022-04-19 08:58:49 +08:00
/* send flush mdlog request to MDS */
if ( last_session ! = s ) {
send_flush_mdlog ( s ) ;
ceph_put_mds_session ( last_session ) ;
last_session = s ;
} else {
ceph_put_mds_session ( s ) ;
}
2023-06-12 09:04:07 +08:00
doutc ( cl , " wait on %llu (want %llu) \n " ,
req - > r_tid , want_tid ) ;
2010-02-15 12:08:46 -08:00
wait_for_completion ( & req - > r_safe_completion ) ;
2022-04-19 08:58:49 +08:00
2010-02-15 12:08:46 -08:00
mutex_lock ( & mdsc - > mutex ) ;
ceph_mdsc_put_request ( req ) ;
2010-03-16 15:28:54 -07:00
if ( ! nextreq )
break ; /* next dne before, so we're done! */
if ( RB_EMPTY_NODE ( & nextreq - > r_node ) ) {
/* next request was removed from tree */
ceph_mdsc_put_request ( nextreq ) ;
goto restart ;
}
ceph_mdsc_put_request ( nextreq ) ; /* won't go away */
2010-02-15 12:08:46 -08:00
}
2010-03-16 15:28:54 -07:00
req = nextreq ;
2009-10-06 11:31:09 -07:00
}
mutex_unlock ( & mdsc - > mutex ) ;
2022-04-19 08:58:49 +08:00
ceph_put_mds_session ( last_session ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " done \n " ) ;
2009-10-06 11:31:09 -07:00
}
void ceph_mdsc_sync ( struct ceph_mds_client * mdsc )
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2016-07-04 18:06:41 +08:00
u64 want_tid , want_flush ;
2009-10-06 11:31:09 -07:00
2020-09-25 07:55:39 -04:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) > = CEPH_MOUNT_SHUTDOWN )
2010-05-03 15:22:00 -07:00
return ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " sync \n " ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
want_tid = mdsc - > last_tid ;
mutex_unlock ( & mdsc - > mutex ) ;
2009-10-14 14:27:38 -07:00
ceph_flush_dirty_caps ( mdsc ) ;
2015-01-08 21:30:12 +08:00
spin_lock ( & mdsc - > cap_dirty_lock ) ;
2015-06-09 17:20:12 +08:00
want_flush = mdsc - > last_cap_flush_tid ;
2016-07-07 15:22:38 +08:00
if ( ! list_empty ( & mdsc - > cap_flush_list ) ) {
struct ceph_cap_flush * cf =
list_last_entry ( & mdsc - > cap_flush_list ,
struct ceph_cap_flush , g_list ) ;
cf - > wake = true ;
}
2015-01-08 21:30:12 +08:00
spin_unlock ( & mdsc - > cap_dirty_lock ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " sync want tid %lld flush_seq %lld \n " , want_tid , want_flush ) ;
2009-10-06 11:31:09 -07:00
2022-04-19 08:58:49 +08:00
flush_mdlog_and_wait_mdsc_unsafe_requests ( mdsc , want_tid ) ;
2016-07-04 18:06:41 +08:00
wait_caps_flush ( mdsc , want_flush ) ;
2009-10-06 11:31:09 -07:00
}
2010-08-11 14:51:23 -07:00
/*
* true if all sessions are closed , or we force unmount
*/
2016-09-14 16:39:51 +08:00
static bool done_closing_sessions ( struct ceph_mds_client * mdsc , int skipped )
2010-08-11 14:51:23 -07:00
{
2016-12-26 10:26:34 +01:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) = = CEPH_MOUNT_SHUTDOWN )
2010-08-11 14:51:23 -07:00
return true ;
2016-09-14 16:39:51 +08:00
return atomic_read ( & mdsc - > num_sessions ) < = skipped ;
2010-08-11 14:51:23 -07:00
}
2009-10-06 11:31:09 -07:00
/*
2023-02-01 09:36:45 +08:00
* called after sb is ro or when metadata corrupted .
2009-10-06 11:31:09 -07:00
*/
void ceph_mdsc_close_sessions ( struct ceph_mds_client * mdsc )
{
2015-05-15 12:02:17 +03:00
struct ceph_options * opts = mdsc - > fsc - > client - > options ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
struct ceph_mds_session * session ;
int i ;
2016-09-14 16:39:51 +08:00
int skipped = 0 ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
doutc ( cl , " begin \n " ) ;
2009-10-06 11:31:09 -07:00
/* close sessions */
2010-08-11 14:51:23 -07:00
mutex_lock ( & mdsc - > mutex ) ;
for ( i = 0 ; i < mdsc - > max_sessions ; i + + ) {
session = __ceph_lookup_mds_session ( mdsc , i ) ;
if ( ! session )
continue ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2010-08-11 14:51:23 -07:00
mutex_lock ( & session - > s_mutex ) ;
2016-09-14 16:39:51 +08:00
if ( __close_session ( mdsc , session ) < = 0 )
skipped + + ;
2010-08-11 14:51:23 -07:00
mutex_unlock ( & session - > s_mutex ) ;
ceph_put_mds_session ( session ) ;
2009-10-06 11:31:09 -07:00
mutex_lock ( & mdsc - > mutex ) ;
}
2010-08-11 14:51:23 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " waiting for sessions to close \n " ) ;
2016-09-14 16:39:51 +08:00
wait_event_timeout ( mdsc - > session_close_wq ,
done_closing_sessions ( mdsc , skipped ) ,
2015-05-15 12:02:17 +03:00
ceph_timeout_jiffies ( opts - > mount_timeout ) ) ;
2009-10-06 11:31:09 -07:00
/* tear down remaining sessions */
2010-08-11 14:51:23 -07:00
mutex_lock ( & mdsc - > mutex ) ;
2009-10-06 11:31:09 -07:00
for ( i = 0 ; i < mdsc - > max_sessions ; i + + ) {
if ( mdsc - > sessions [ i ] ) {
2019-12-19 19:44:09 -05:00
session = ceph_get_mds_session ( mdsc - > sessions [ i ] ) ;
2010-02-22 15:12:16 -08:00
__unregister_session ( mdsc , session ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
mutex_lock ( & session - > s_mutex ) ;
remove_session_caps ( session ) ;
mutex_unlock ( & session - > s_mutex ) ;
ceph_put_mds_session ( session ) ;
mutex_lock ( & mdsc - > mutex ) ;
}
}
WARN_ON ( ! list_empty ( & mdsc - > cap_delay_list ) ) ;
mutex_unlock ( & mdsc - > mutex ) ;
2017-12-14 15:11:09 +08:00
ceph_cleanup_snapid_map ( mdsc ) ;
2022-02-23 09:04:56 +08:00
ceph_cleanup_global_and_empty_realms ( mdsc ) ;
2009-10-06 11:31:09 -07:00
2019-01-31 16:55:51 +08:00
cancel_work_sync ( & mdsc - > cap_reclaim_work ) ;
2009-10-06 11:31:09 -07:00
cancel_delayed_work_sync ( & mdsc - > delayed_work ) ; /* cancel timer */
2023-06-12 09:04:07 +08:00
doutc ( cl , " done \n " ) ;
2009-10-06 11:31:09 -07:00
}
2015-07-01 16:27:46 +08:00
void ceph_mdsc_force_umount ( struct ceph_mds_client * mdsc )
{
struct ceph_mds_session * session ;
int mds ;
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " force umount \n " ) ;
2015-07-01 16:27:46 +08:00
mutex_lock ( & mdsc - > mutex ) ;
for ( mds = 0 ; mds < mdsc - > max_sessions ; mds + + ) {
session = __ceph_lookup_mds_session ( mdsc , mds ) ;
if ( ! session )
continue ;
2019-07-25 20:16:44 +08:00
if ( session - > s_state = = CEPH_MDS_SESSION_REJECTED )
__unregister_session ( mdsc , session ) ;
__wake_requests ( mdsc , & session - > s_waiting ) ;
2015-07-01 16:27:46 +08:00
mutex_unlock ( & mdsc - > mutex ) ;
2019-07-25 20:16:44 +08:00
2015-07-01 16:27:46 +08:00
mutex_lock ( & session - > s_mutex ) ;
__close_session ( mdsc , session ) ;
if ( session - > s_state = = CEPH_MDS_SESSION_CLOSING ) {
cleanup_session_requests ( mdsc , session ) ;
remove_session_caps ( session ) ;
}
mutex_unlock ( & session - > s_mutex ) ;
ceph_put_mds_session ( session ) ;
2019-07-25 20:16:44 +08:00
2015-07-01 16:27:46 +08:00
mutex_lock ( & mdsc - > mutex ) ;
kick_requests ( mdsc , mds ) ;
}
__wake_requests ( mdsc , & mdsc - > waiting_for_map ) ;
mutex_unlock ( & mdsc - > mutex ) ;
}
2010-04-06 15:14:15 -07:00
static void ceph_mdsc_stop ( struct ceph_mds_client * mdsc )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
doutc ( mdsc - > fsc - > client , " stop \n " ) ;
2020-07-01 01:52:48 -04:00
/*
* Make sure the delayed work stopped before releasing
* the resources .
*
* Because the cancel_delayed_work_sync ( ) will only
* guarantee that the work finishes executing . But the
* delayed work will re - arm itself again after that .
*/
flush_delayed_work ( & mdsc - > delayed_work ) ;
2009-10-06 11:31:09 -07:00
if ( mdsc - > mdsmap )
ceph_mdsmap_destroy ( mdsc - > mdsmap ) ;
kfree ( mdsc - > sessions ) ;
2010-06-17 16:16:12 -07:00
ceph_caps_finalize ( mdsc ) ;
2015-04-27 15:33:28 +08:00
ceph_pool_perm_destroy ( mdsc ) ;
2009-10-06 11:31:09 -07:00
}
2010-04-06 15:14:15 -07:00
void ceph_mdsc_destroy ( struct ceph_fs_client * fsc )
{
struct ceph_mds_client * mdsc = fsc - > mdsc ;
2023-06-12 09:04:07 +08:00
doutc ( fsc - > client , " %p \n " , mdsc ) ;
2011-03-25 13:27:48 -07:00
2018-03-14 13:47:33 +08:00
if ( ! mdsc )
return ;
2011-03-25 13:27:48 -07:00
/* flush out any connection work with references to us */
ceph_msgr_flush ( ) ;
2017-06-22 16:26:34 +08:00
ceph_mdsc_stop ( mdsc ) ;
2020-03-19 23:44:59 -04:00
ceph_metric_destroy ( & mdsc - > metric ) ;
2010-04-06 15:14:15 -07:00
fsc - > mdsc = NULL ;
kfree ( mdsc ) ;
2023-06-12 09:04:07 +08:00
doutc ( fsc - > client , " %p done \n " , mdsc ) ;
2010-04-06 15:14:15 -07:00
}
2016-07-08 11:25:38 +08:00
void ceph_mdsc_handle_fsmap ( struct ceph_mds_client * mdsc , struct ceph_msg * msg )
{
struct ceph_fs_client * fsc = mdsc - > fsc ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = fsc - > client ;
2016-07-08 11:25:38 +08:00
const char * mds_namespace = fsc - > mount_options - > mds_namespace ;
void * p = msg - > front . iov_base ;
void * end = p + msg - > front . iov_len ;
u32 epoch ;
u32 num_fs ;
u32 mount_fscid = ( u32 ) - 1 ;
int err = - EINVAL ;
ceph_decode_need ( & p , end , sizeof ( u32 ) , bad ) ;
epoch = ceph_decode_32 ( & p ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " epoch %u \n " , epoch ) ;
2016-07-08 11:25:38 +08:00
2020-09-29 19:32:19 -04:00
/* struct_v, struct_cv, map_len, epoch, legacy_client_fscid */
ceph_decode_skip_n ( & p , end , 2 + sizeof ( u32 ) * 3 , bad ) ;
2016-07-08 11:25:38 +08:00
2020-09-29 19:32:19 -04:00
ceph_decode_32_safe ( & p , end , num_fs , bad ) ;
2016-07-08 11:25:38 +08:00
while ( num_fs - - > 0 ) {
void * info_p , * info_end ;
u32 info_len ;
u32 fscid , namelen ;
ceph_decode_need ( & p , end , 2 + sizeof ( u32 ) , bad ) ;
2020-09-29 19:32:19 -04:00
p + = 2 ; // info_v, info_cv
2016-07-08 11:25:38 +08:00
info_len = ceph_decode_32 ( & p ) ;
ceph_decode_need ( & p , end , info_len , bad ) ;
info_p = p ;
info_end = p + info_len ;
p = info_end ;
ceph_decode_need ( & info_p , info_end , sizeof ( u32 ) * 2 , bad ) ;
fscid = ceph_decode_32 ( & info_p ) ;
namelen = ceph_decode_32 ( & info_p ) ;
ceph_decode_need ( & info_p , info_end , namelen , bad ) ;
if ( mds_namespace & &
strlen ( mds_namespace ) = = namelen & &
! strncmp ( mds_namespace , ( char * ) info_p , namelen ) ) {
mount_fscid = fscid ;
break ;
}
}
ceph_monc_got_map ( & fsc - > client - > monc , CEPH_SUB_FSMAP , epoch ) ;
if ( mount_fscid ! = ( u32 ) - 1 ) {
fsc - > client - > monc . fs_cluster_id = mount_fscid ;
ceph_monc_want_map ( & fsc - > client - > monc , CEPH_SUB_MDSMAP ,
0 , true ) ;
ceph_monc_renew_subs ( & fsc - > client - > monc ) ;
} else {
err = - ENOENT ;
goto err_out ;
}
return ;
2017-10-16 10:32:50 +02:00
2016-07-08 11:25:38 +08:00
bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " error decoding fsmap %d. Shutting down mount. \n " ,
err ) ;
2021-10-14 11:10:47 -04:00
ceph_umount_begin ( mdsc - > fsc - > sb ) ;
2023-05-18 09:40:14 +08:00
ceph_msg_dump ( msg ) ;
2016-07-08 11:25:38 +08:00
err_out :
mutex_lock ( & mdsc - > mutex ) ;
2017-10-16 10:32:50 +02:00
mdsc - > mdsmap_err = err ;
2016-07-08 11:25:38 +08:00
__wake_requests ( mdsc , & mdsc - > waiting_for_map ) ;
mutex_unlock ( & mdsc - > mutex ) ;
}
2009-10-06 11:31:09 -07:00
/*
* handle mds map update .
*/
2016-07-08 11:25:38 +08:00
void ceph_mdsc_handle_mdsmap ( struct ceph_mds_client * mdsc , struct ceph_msg * msg )
2009-10-06 11:31:09 -07:00
{
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
u32 epoch ;
u32 maplen ;
void * p = msg - > front . iov_base ;
void * end = p + msg - > front . iov_len ;
struct ceph_mdsmap * newmap , * oldmap ;
struct ceph_fsid fsid ;
int err = - EINVAL ;
ceph_decode_need ( & p , end , sizeof ( fsid ) + 2 * sizeof ( u32 ) , bad ) ;
ceph_decode_copy ( & p , & fsid , sizeof ( fsid ) ) ;
2010-04-06 15:14:15 -07:00
if ( ceph_check_fsid ( mdsc - > fsc - > client , & fsid ) < 0 )
2009-11-18 16:50:41 -08:00
return ;
2009-10-14 09:59:09 -07:00
epoch = ceph_decode_32 ( & p ) ;
maplen = ceph_decode_32 ( & p ) ;
2023-06-12 09:04:07 +08:00
doutc ( cl , " epoch %u len %d \n " , epoch , ( int ) maplen ) ;
2009-10-06 11:31:09 -07:00
/* do we need it? */
mutex_lock ( & mdsc - > mutex ) ;
if ( mdsc - > mdsmap & & epoch < = mdsc - > mdsmap - > m_epoch ) {
2023-06-12 09:04:07 +08:00
doutc ( cl , " epoch %u <= our %u \n " , epoch , mdsc - > mdsmap - > m_epoch ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
return ;
}
2023-06-09 15:15:47 +08:00
newmap = ceph_mdsmap_decode ( mdsc , & p , end , ceph_msgr2 ( mdsc - > fsc - > client ) ) ;
2009-10-06 11:31:09 -07:00
if ( IS_ERR ( newmap ) ) {
err = PTR_ERR ( newmap ) ;
goto bad_unlock ;
}
/* swap into place */
if ( mdsc - > mdsmap ) {
oldmap = mdsc - > mdsmap ;
mdsc - > mdsmap = newmap ;
check_new_map ( mdsc , newmap , oldmap ) ;
ceph_mdsmap_destroy ( oldmap ) ;
} else {
mdsc - > mdsmap = newmap ; /* first mds map */
}
2018-07-19 22:15:24 +08:00
mdsc - > fsc - > max_file_size = min ( ( loff_t ) mdsc - > mdsmap - > m_max_file_size ,
MAX_LFS_FILESIZE ) ;
2009-10-06 11:31:09 -07:00
__wake_requests ( mdsc , & mdsc - > waiting_for_map ) ;
2016-01-19 16:19:06 +01:00
ceph_monc_got_map ( & mdsc - > fsc - > client - > monc , CEPH_SUB_MDSMAP ,
mdsc - > mdsmap - > m_epoch ) ;
2009-10-06 11:31:09 -07:00
mutex_unlock ( & mdsc - > mutex ) ;
2021-07-06 14:52:41 +01:00
schedule_delayed ( mdsc , 0 ) ;
2009-10-06 11:31:09 -07:00
return ;
bad_unlock :
mutex_unlock ( & mdsc - > mutex ) ;
bad :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " error decoding mdsmap %d. Shutting down mount. \n " ,
err ) ;
2021-10-14 11:10:47 -04:00
ceph_umount_begin ( mdsc - > fsc - > sb ) ;
2023-05-18 09:40:14 +08:00
ceph_msg_dump ( msg ) ;
2009-10-06 11:31:09 -07:00
return ;
}
2020-12-23 16:32:05 +01:00
static struct ceph_connection * mds_get_con ( struct ceph_connection * con )
2009-10-06 11:31:09 -07:00
{
struct ceph_mds_session * s = con - > private ;
2019-12-19 19:44:09 -05:00
if ( ceph_get_mds_session ( s ) )
2009-10-06 11:31:09 -07:00
return con ;
return NULL ;
}
2020-12-23 16:32:05 +01:00
static void mds_put_con ( struct ceph_connection * con )
2009-10-06 11:31:09 -07:00
{
struct ceph_mds_session * s = con - > private ;
ceph_put_mds_session ( s ) ;
}
/*
* if the client is unresponsive for long enough , the mds will kill
* the session entirely .
*/
2020-12-23 16:32:05 +01:00
static void mds_peer_reset ( struct ceph_connection * con )
2009-10-06 11:31:09 -07:00
{
struct ceph_mds_session * s = con - > private ;
ceph: attempt mds reconnect if mds closes our session
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.
Change client to attempt an immediate reconnect if our session is closed.
Note that currently the MDS doesn't support this, and our attempt will
fail. We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect. That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-18 13:59:12 -07:00
struct ceph_mds_client * mdsc = s - > s_mdsc ;
2009-10-06 11:31:09 -07:00
2023-06-12 09:04:07 +08:00
pr_warn_client ( mdsc - > fsc - > client , " mds%d closed our session \n " ,
s - > s_mds ) ;
2023-02-01 09:36:45 +08:00
if ( READ_ONCE ( mdsc - > fsc - > mount_state ) ! = CEPH_MOUNT_FENCE_IO )
send_mds_reconnect ( mdsc , s ) ;
2009-10-06 11:31:09 -07:00
}
2020-12-23 16:32:05 +01:00
static void mds_dispatch ( struct ceph_connection * con , struct ceph_msg * msg )
2009-10-06 11:31:09 -07:00
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mds_client * mdsc = s - > s_mdsc ;
2023-06-12 09:04:07 +08:00
struct ceph_client * cl = mdsc - > fsc - > client ;
2009-10-06 11:31:09 -07:00
int type = le16_to_cpu ( msg - > hdr . type ) ;
2010-02-22 15:12:16 -08:00
mutex_lock ( & mdsc - > mutex ) ;
if ( __verify_registered_session ( mdsc , s ) < 0 ) {
mutex_unlock ( & mdsc - > mutex ) ;
goto out ;
}
mutex_unlock ( & mdsc - > mutex ) ;
2009-10-06 11:31:09 -07:00
switch ( type ) {
case CEPH_MSG_MDS_MAP :
2016-07-08 11:25:38 +08:00
ceph_mdsc_handle_mdsmap ( mdsc , msg ) ;
break ;
case CEPH_MSG_FS_MAP_USER :
ceph_mdsc_handle_fsmap ( mdsc , msg ) ;
2009-10-06 11:31:09 -07:00
break ;
case CEPH_MSG_CLIENT_SESSION :
handle_session ( s , msg ) ;
break ;
case CEPH_MSG_CLIENT_REPLY :
handle_reply ( s , msg ) ;
break ;
case CEPH_MSG_CLIENT_REQUEST_FORWARD :
2010-02-22 15:12:16 -08:00
handle_forward ( mdsc , s , msg ) ;
2009-10-06 11:31:09 -07:00
break ;
case CEPH_MSG_CLIENT_CAPS :
ceph_handle_caps ( s , msg ) ;
break ;
case CEPH_MSG_CLIENT_SNAP :
2010-02-22 15:12:16 -08:00
ceph_handle_snap ( mdsc , s , msg ) ;
2009-10-06 11:31:09 -07:00
break ;
case CEPH_MSG_CLIENT_LEASE :
2010-02-22 15:12:16 -08:00
handle_lease ( mdsc , s , msg ) ;
2009-10-06 11:31:09 -07:00
break ;
2018-01-05 10:47:18 +00:00
case CEPH_MSG_CLIENT_QUOTA :
ceph_handle_quota ( mdsc , s , msg ) ;
break ;
2009-10-06 11:31:09 -07:00
default :
2023-06-12 09:04:07 +08:00
pr_err_client ( cl , " received unknown message type %d %s \n " ,
type , ceph_msg_type_name ( type ) ) ;
2009-10-06 11:31:09 -07:00
}
2010-02-22 15:12:16 -08:00
out :
2009-10-06 11:31:09 -07:00
ceph_msg_put ( msg ) ;
}
2009-11-18 16:19:57 -08:00
/*
* authentication
*/
2012-05-16 15:16:39 -05:00
/*
* Note : returned pointer is the address of a structure that ' s
* managed separately . Caller must * not * attempt to free it .
*/
2020-12-23 16:32:05 +01:00
static struct ceph_auth_handshake *
mds_get_authorizer ( struct ceph_connection * con , int * proto , int force_new )
2009-11-18 16:19:57 -08:00
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mds_client * mdsc = s - > s_mdsc ;
2010-04-06 15:14:15 -07:00
struct ceph_auth_client * ac = mdsc - > fsc - > client - > monc . auth ;
2012-05-16 15:16:39 -05:00
struct ceph_auth_handshake * auth = & s - > s_auth ;
2020-11-19 19:13:58 +01:00
int ret ;
2009-11-18 16:19:57 -08:00
2020-11-19 19:13:58 +01:00
ret = __ceph_auth_get_authorizer ( ac , auth , CEPH_ENTITY_TYPE_MDS ,
force_new , proto , NULL , NULL ) ;
if ( ret )
return ERR_PTR ( ret ) ;
2012-05-16 15:16:39 -05:00
2012-05-16 15:16:39 -05:00
return auth ;
2009-11-18 16:19:57 -08:00
}
2020-12-23 16:32:05 +01:00
static int mds_add_authorizer_challenge ( struct ceph_connection * con ,
2018-07-27 19:18:34 +02:00
void * challenge_buf , int challenge_buf_len )
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mds_client * mdsc = s - > s_mdsc ;
struct ceph_auth_client * ac = mdsc - > fsc - > client - > monc . auth ;
return ceph_auth_add_authorizer_challenge ( ac , s - > s_auth . authorizer ,
challenge_buf , challenge_buf_len ) ;
}
2009-11-18 16:19:57 -08:00
2020-12-23 16:32:05 +01:00
static int mds_verify_authorizer_reply ( struct ceph_connection * con )
2009-11-18 16:19:57 -08:00
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mds_client * mdsc = s - > s_mdsc ;
2010-04-06 15:14:15 -07:00
struct ceph_auth_client * ac = mdsc - > fsc - > client - > monc . auth ;
2020-10-26 16:47:20 +01:00
struct ceph_auth_handshake * auth = & s - > s_auth ;
2009-11-18 16:19:57 -08:00
2020-10-26 16:47:20 +01:00
return ceph_auth_verify_authorizer_reply ( ac , auth - > authorizer ,
auth - > authorizer_reply_buf , auth - > authorizer_reply_buf_len ,
NULL , NULL , NULL , NULL ) ;
2009-11-18 16:19:57 -08:00
}
2020-12-23 16:32:05 +01:00
static int mds_invalidate_authorizer ( struct ceph_connection * con )
2010-02-02 16:21:06 -08:00
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mds_client * mdsc = s - > s_mdsc ;
2010-04-06 15:14:15 -07:00
struct ceph_auth_client * ac = mdsc - > fsc - > client - > monc . auth ;
2010-02-02 16:21:06 -08:00
2013-03-25 10:26:14 -07:00
ceph_auth_invalidate_authorizer ( ac , CEPH_ENTITY_TYPE_MDS ) ;
2010-02-02 16:21:06 -08:00
2010-04-06 15:14:15 -07:00
return ceph_monc_validate_auth ( & mdsc - > fsc - > client - > monc ) ;
2010-02-02 16:21:06 -08:00
}
2020-11-19 16:59:08 +01:00
static int mds_get_auth_request ( struct ceph_connection * con ,
void * buf , int * buf_len ,
void * * authorizer , int * authorizer_len )
{
struct ceph_mds_session * s = con - > private ;
struct ceph_auth_client * ac = s - > s_mdsc - > fsc - > client - > monc . auth ;
struct ceph_auth_handshake * auth = & s - > s_auth ;
int ret ;
ret = ceph_auth_get_authorizer ( ac , auth , CEPH_ENTITY_TYPE_MDS ,
buf , buf_len ) ;
if ( ret )
return ret ;
* authorizer = auth - > authorizer_buf ;
* authorizer_len = auth - > authorizer_buf_len ;
return 0 ;
}
static int mds_handle_auth_reply_more ( struct ceph_connection * con ,
void * reply , int reply_len ,
void * buf , int * buf_len ,
void * * authorizer , int * authorizer_len )
{
struct ceph_mds_session * s = con - > private ;
struct ceph_auth_client * ac = s - > s_mdsc - > fsc - > client - > monc . auth ;
struct ceph_auth_handshake * auth = & s - > s_auth ;
int ret ;
ret = ceph_auth_handle_svc_reply_more ( ac , auth , reply , reply_len ,
buf , buf_len ) ;
if ( ret )
return ret ;
* authorizer = auth - > authorizer_buf ;
* authorizer_len = auth - > authorizer_buf_len ;
return 0 ;
}
static int mds_handle_auth_done ( struct ceph_connection * con ,
u64 global_id , void * reply , int reply_len ,
u8 * session_key , int * session_key_len ,
u8 * con_secret , int * con_secret_len )
{
struct ceph_mds_session * s = con - > private ;
struct ceph_auth_client * ac = s - > s_mdsc - > fsc - > client - > monc . auth ;
struct ceph_auth_handshake * auth = & s - > s_auth ;
return ceph_auth_handle_svc_reply_done ( ac , auth , reply , reply_len ,
session_key , session_key_len ,
con_secret , con_secret_len ) ;
}
static int mds_handle_auth_bad_method ( struct ceph_connection * con ,
int used_proto , int result ,
const int * allowed_protos , int proto_cnt ,
const int * allowed_modes , int mode_cnt )
{
struct ceph_mds_session * s = con - > private ;
struct ceph_mon_client * monc = & s - > s_mdsc - > fsc - > client - > monc ;
int ret ;
if ( ceph_auth_handle_bad_authorizer ( monc - > auth , CEPH_ENTITY_TYPE_MDS ,
used_proto , result ,
allowed_protos , proto_cnt ,
allowed_modes , mode_cnt ) ) {
ret = ceph_monc_validate_auth ( monc ) ;
if ( ret )
return ret ;
}
return - EACCES ;
}
2013-03-01 18:00:14 -06:00
static struct ceph_msg * mds_alloc_msg ( struct ceph_connection * con ,
struct ceph_msg_header * hdr , int * skip )
{
struct ceph_msg * msg ;
int type = ( int ) le16_to_cpu ( hdr - > type ) ;
int front_len = ( int ) le32_to_cpu ( hdr - > front_len ) ;
if ( con - > in_msg )
return con - > in_msg ;
* skip = 0 ;
msg = ceph_msg_new ( type , front_len , GFP_NOFS , false ) ;
if ( ! msg ) {
pr_err ( " unable to allocate msg type %d len %d \n " ,
type , front_len ) ;
return NULL ;
}
return msg ;
}
2015-10-26 22:23:56 +01:00
static int mds_sign_message ( struct ceph_msg * msg )
2014-11-04 16:33:37 +08:00
{
2015-10-26 22:23:56 +01:00
struct ceph_mds_session * s = msg - > con - > private ;
2014-11-04 16:33:37 +08:00
struct ceph_auth_handshake * auth = & s - > s_auth ;
2015-10-26 22:23:56 +01:00
2014-11-04 16:33:37 +08:00
return ceph_auth_sign_message ( auth , msg ) ;
}
2015-10-26 22:23:56 +01:00
static int mds_check_message_signature ( struct ceph_msg * msg )
2014-11-04 16:33:37 +08:00
{
2015-10-26 22:23:56 +01:00
struct ceph_mds_session * s = msg - > con - > private ;
2014-11-04 16:33:37 +08:00
struct ceph_auth_handshake * auth = & s - > s_auth ;
2015-10-26 22:23:56 +01:00
2014-11-04 16:33:37 +08:00
return ceph_auth_check_message_signature ( auth , msg ) ;
}
2010-05-20 10:40:19 +02:00
static const struct ceph_connection_operations mds_con_ops = {
2020-12-23 16:32:05 +01:00
. get = mds_get_con ,
. put = mds_put_con ,
2013-03-01 18:00:14 -06:00
. alloc_msg = mds_alloc_msg ,
2020-12-23 16:32:05 +01:00
. dispatch = mds_dispatch ,
. peer_reset = mds_peer_reset ,
. get_authorizer = mds_get_authorizer ,
. add_authorizer_challenge = mds_add_authorizer_challenge ,
. verify_authorizer_reply = mds_verify_authorizer_reply ,
. invalidate_authorizer = mds_invalidate_authorizer ,
2015-10-26 22:23:56 +01:00
. sign_message = mds_sign_message ,
. check_message_signature = mds_check_message_signature ,
2020-11-19 16:59:08 +01:00
. get_auth_request = mds_get_auth_request ,
. handle_auth_reply_more = mds_handle_auth_reply_more ,
. handle_auth_done = mds_handle_auth_done ,
. handle_auth_bad_method = mds_handle_auth_bad_method ,
2009-10-06 11:31:09 -07:00
} ;
/* eof */