2018-05-04 15:34:32 -04:00
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
2007-09-10 13:50:12 -04:00
/*
2017-10-30 16:22:14 -04:00
* Copyright ( c ) 2014 - 2017 Oracle . All rights reserved .
2007-09-10 13:50:12 -04:00
* Copyright ( c ) 2003 - 2007 Network Appliance , Inc . All rights reserved .
*
* This software is available to you under a choice of one of two
* licenses . You may choose to be licensed under the terms of the GNU
* General Public License ( GPL ) Version 2 , available from the file
* COPYING in the main directory of this source tree , or the BSD - type
* license below :
*
* Redistribution and use in source and binary forms , with or without
* modification , are permitted provided that the following conditions
* are met :
*
* Redistributions of source code must retain the above copyright
* notice , this list of conditions and the following disclaimer .
*
* Redistributions in binary form must reproduce the above
* copyright notice , this list of conditions and the following
* disclaimer in the documentation and / or other materials provided
* with the distribution .
*
* Neither the name of the Network Appliance , Inc . nor the names of
* its contributors may be used to endorse or promote products
* derived from this software without specific prior written
* permission .
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* " AS IS " AND ANY EXPRESS OR IMPLIED WARRANTIES , INCLUDING , BUT NOT
* LIMITED TO , THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED . IN NO EVENT SHALL THE COPYRIGHT
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT , INDIRECT , INCIDENTAL ,
* SPECIAL , EXEMPLARY , OR CONSEQUENTIAL DAMAGES ( INCLUDING , BUT NOT
* LIMITED TO , PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES ; LOSS OF USE ,
* DATA , OR PROFITS ; OR BUSINESS INTERRUPTION ) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY , WHETHER IN CONTRACT , STRICT LIABILITY , OR TORT
* ( INCLUDING NEGLIGENCE OR OTHERWISE ) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE , EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE .
*/
/*
* transport . c
*
* This file contains the top - level implementation of an RPC RDMA
* transport .
*
* Naming convention : functions beginning with xprt_ are part of the
* transport switch . All others are RPC RDMA internal .
*/
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2007-09-10 13:50:12 -04:00
# include <linux/seq_file.h>
2018-05-07 15:27:16 -04:00
# include <linux/smp.h>
2013-02-04 12:50:00 -05:00
# include <linux/sunrpc/addr.h>
2018-05-07 15:27:16 -04:00
# include <linux/sunrpc/svc_rdma.h>
2007-09-10 13:50:12 -04:00
# include "xprt_rdma.h"
2018-05-07 15:27:05 -04:00
# include <trace/events/rpcrdma.h>
2007-09-10 13:50:12 -04:00
2014-11-17 16:58:04 -05:00
# if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
2007-09-10 13:50:12 -04:00
# define RPCDBG_FACILITY RPCDBG_TRANS
# endif
/*
* tunables
*/
2019-04-24 09:40:25 -04:00
unsigned int xprt_rdma_slot_table_entries = RPCRDMA_DEF_SLOT_TABLE ;
2016-01-07 14:50:10 -05:00
unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE ;
2019-04-24 09:40:20 -04:00
unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE ;
2017-12-14 20:57:47 -05:00
unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRWR ;
2017-04-11 13:22:54 -04:00
int xprt_rdma_pad_optimize ;
2007-09-10 13:50:12 -04:00
2014-11-17 16:58:04 -05:00
# if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
2007-09-10 13:50:12 -04:00
static unsigned int min_slot_table_size = RPCRDMA_MIN_SLOT_TABLE ;
static unsigned int max_slot_table_size = RPCRDMA_MAX_SLOT_TABLE ;
2016-05-02 14:40:48 -04:00
static unsigned int min_inline_size = RPCRDMA_MIN_INLINE ;
static unsigned int max_inline_size = RPCRDMA_MAX_INLINE ;
2007-09-10 13:50:12 -04:00
static unsigned int max_padding = PAGE_SIZE ;
static unsigned int min_memreg = RPCRDMA_BOUNCEBUFFERS ;
static unsigned int max_memreg = RPCRDMA_LAST - 1 ;
2017-12-14 20:56:42 -05:00
static unsigned int dummy ;
2007-09-10 13:50:12 -04:00
static struct ctl_table_header * sunrpc_table_header ;
2013-06-11 23:04:25 -07:00
static struct ctl_table xr_tunables_table [ ] = {
2007-09-10 13:50:12 -04:00
{
. procname = " rdma_slot_table_entries " ,
. data = & xprt_rdma_slot_table_entries ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2009-11-16 03:11:48 -08:00
. proc_handler = proc_dointvec_minmax ,
2007-09-10 13:50:12 -04:00
. extra1 = & min_slot_table_size ,
. extra2 = & max_slot_table_size
} ,
{
. procname = " rdma_max_inline_read " ,
. data = & xprt_rdma_max_inline_read ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2016-09-15 10:57:32 -04:00
. proc_handler = proc_dointvec_minmax ,
2016-05-02 14:40:48 -04:00
. extra1 = & min_inline_size ,
. extra2 = & max_inline_size ,
2007-09-10 13:50:12 -04:00
} ,
{
. procname = " rdma_max_inline_write " ,
. data = & xprt_rdma_max_inline_write ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2016-09-15 10:57:32 -04:00
. proc_handler = proc_dointvec_minmax ,
2016-05-02 14:40:48 -04:00
. extra1 = & min_inline_size ,
. extra2 = & max_inline_size ,
2007-09-10 13:50:12 -04:00
} ,
{
. procname = " rdma_inline_write_padding " ,
2017-12-14 20:56:42 -05:00
. data = & dummy ,
2007-09-10 13:50:12 -04:00
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2009-11-16 03:11:48 -08:00
. proc_handler = proc_dointvec_minmax ,
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 15:58:50 -07:00
. extra1 = SYSCTL_ZERO ,
2007-09-10 13:50:12 -04:00
. extra2 = & max_padding ,
} ,
{
. procname = " rdma_memreg_strategy " ,
. data = & xprt_rdma_memreg_strategy ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2009-11-16 03:11:48 -08:00
. proc_handler = proc_dointvec_minmax ,
2007-09-10 13:50:12 -04:00
. extra1 = & min_memreg ,
. extra2 = & max_memreg ,
} ,
2008-10-09 15:01:11 -04:00
{
. procname = " rdma_pad_optimize " ,
. data = & xprt_rdma_pad_optimize ,
. maxlen = sizeof ( unsigned int ) ,
. mode = 0644 ,
2009-11-16 03:11:48 -08:00
. proc_handler = proc_dointvec ,
2008-10-09 15:01:11 -04:00
} ,
2009-11-05 13:32:03 -08:00
{ } ,
2007-09-10 13:50:12 -04:00
} ;
2013-06-11 23:04:25 -07:00
static struct ctl_table sunrpc_table [ ] = {
2007-09-10 13:50:12 -04:00
{
. procname = " sunrpc " ,
. mode = 0555 ,
. child = xr_tunables_table
} ,
2009-11-05 13:32:03 -08:00
{ } ,
2007-09-10 13:50:12 -04:00
} ;
# endif
2017-08-01 12:00:39 -04:00
static const struct rpc_xprt_ops xprt_rdma_procs ;
2007-09-10 13:50:12 -04:00
2015-03-30 14:33:43 -04:00
static void
xprt_rdma_format_addresses4 ( struct rpc_xprt * xprt , struct sockaddr * sap )
{
struct sockaddr_in * sin = ( struct sockaddr_in * ) sap ;
char buf [ 20 ] ;
snprintf ( buf , sizeof ( buf ) , " %08x " , ntohl ( sin - > sin_addr . s_addr ) ) ;
xprt - > address_strings [ RPC_DISPLAY_HEX_ADDR ] = kstrdup ( buf , GFP_KERNEL ) ;
xprt - > address_strings [ RPC_DISPLAY_NETID ] = RPCBIND_NETID_RDMA ;
}
static void
xprt_rdma_format_addresses6 ( struct rpc_xprt * xprt , struct sockaddr * sap )
{
struct sockaddr_in6 * sin6 = ( struct sockaddr_in6 * ) sap ;
char buf [ 40 ] ;
snprintf ( buf , sizeof ( buf ) , " %pi6 " , & sin6 - > sin6_addr ) ;
xprt - > address_strings [ RPC_DISPLAY_HEX_ADDR ] = kstrdup ( buf , GFP_KERNEL ) ;
xprt - > address_strings [ RPC_DISPLAY_NETID ] = RPCBIND_NETID_RDMA6 ;
}
2016-01-07 14:50:10 -05:00
void
2015-08-03 13:02:41 -04:00
xprt_rdma_format_addresses ( struct rpc_xprt * xprt , struct sockaddr * sap )
2007-09-10 13:50:12 -04:00
{
2015-03-30 14:33:43 -04:00
char buf [ 128 ] ;
switch ( sap - > sa_family ) {
case AF_INET :
xprt_rdma_format_addresses4 ( xprt , sap ) ;
break ;
case AF_INET6 :
xprt_rdma_format_addresses6 ( xprt , sap ) ;
break ;
default :
pr_err ( " rpcrdma: Unrecognized address family \n " ) ;
return ;
}
2007-09-10 13:50:12 -04:00
2009-08-09 15:09:36 -04:00
( void ) rpc_ntop ( sap , buf , sizeof ( buf ) ) ;
xprt - > address_strings [ RPC_DISPLAY_ADDR ] = kstrdup ( buf , GFP_KERNEL ) ;
2007-09-10 13:50:12 -04:00
2010-03-08 12:15:59 -08:00
snprintf ( buf , sizeof ( buf ) , " %u " , rpc_get_port ( sap ) ) ;
2009-08-09 15:09:36 -04:00
xprt - > address_strings [ RPC_DISPLAY_PORT ] = kstrdup ( buf , GFP_KERNEL ) ;
2007-09-10 13:50:12 -04:00
2010-03-08 12:15:59 -08:00
snprintf ( buf , sizeof ( buf ) , " %4hx " , rpc_get_port ( sap ) ) ;
2009-08-09 15:09:36 -04:00
xprt - > address_strings [ RPC_DISPLAY_HEX_PORT ] = kstrdup ( buf , GFP_KERNEL ) ;
2007-09-10 13:50:12 -04:00
2015-03-30 14:33:43 -04:00
xprt - > address_strings [ RPC_DISPLAY_PROTO ] = " rdma " ;
2007-09-10 13:50:12 -04:00
}
2016-01-07 14:50:10 -05:00
void
2007-09-10 13:50:12 -04:00
xprt_rdma_free_addresses ( struct rpc_xprt * xprt )
{
2008-01-14 12:32:20 -05:00
unsigned int i ;
for ( i = 0 ; i < RPC_DISPLAY_MAX ; i + + )
switch ( i ) {
case RPC_DISPLAY_PROTO :
case RPC_DISPLAY_NETID :
continue ;
default :
kfree ( xprt - > address_strings [ i ] ) ;
}
2007-09-10 13:50:12 -04:00
}
2018-10-01 14:26:08 -04:00
/**
* xprt_rdma_connect_worker - establish connection in the background
* @ work : worker thread context
*
* Requester holds the xprt ' s send lock to prevent activity on this
* transport while a fresh connection is being established . RPC tasks
* sleep on the xprt ' s pending queue waiting for connect to complete .
*/
2007-09-10 13:50:12 -04:00
static void
xprt_rdma_connect_worker ( struct work_struct * work )
{
2015-01-21 11:02:37 -05:00
struct rpcrdma_xprt * r_xprt = container_of ( work , struct rpcrdma_xprt ,
rx_connect_worker . work ) ;
struct rpc_xprt * xprt = & r_xprt - > rx_xprt ;
2018-10-01 14:26:08 -04:00
int rc ;
2012-09-11 17:21:25 -04:00
rc = rpcrdma_ep_connect ( & r_xprt - > rx_ep , & r_xprt - > rx_ia ) ;
2007-09-10 13:50:12 -04:00
xprt_clear_connecting ( xprt ) ;
2018-10-01 14:26:08 -04:00
if ( r_xprt - > rx_ep . rep_connected > 0 ) {
2019-10-23 10:01:52 -04:00
xprt - > stat . connect_count + + ;
xprt - > stat . connect_time + = ( long ) jiffies -
xprt - > stat . connect_start ;
xprt_set_connected ( xprt ) ;
rc = - EAGAIN ;
2016-11-29 10:53:37 -05:00
}
2019-10-23 10:01:52 -04:00
xprt_wake_pending_tasks ( xprt , rc ) ;
2007-09-10 13:50:12 -04:00
}
2018-10-01 14:26:40 -04:00
/**
* xprt_rdma_inject_disconnect - inject a connection fault
* @ xprt : transport context
*
* If @ xprt is connected , disconnect it to simulate spurious connection
* loss .
*/
2015-05-11 14:02:25 -04:00
static void
xprt_rdma_inject_disconnect ( struct rpc_xprt * xprt )
{
2018-10-01 14:26:45 -04:00
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
2015-05-11 14:02:25 -04:00
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_inject_dsc ( r_xprt ) ;
2015-05-11 14:02:25 -04:00
rdma_disconnect ( r_xprt - > rx_ia . ri_id ) ;
}
2018-10-01 14:26:40 -04:00
/**
* xprt_rdma_destroy - Full tear down of transport
* @ xprt : doomed transport context
2007-09-10 13:50:12 -04:00
*
2018-10-01 14:26:40 -04:00
* Caller guarantees there will be no more calls to us with
* this @ xprt .
2007-09-10 13:50:12 -04:00
*/
static void
xprt_rdma_destroy ( struct rpc_xprt * xprt )
{
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_destroy ( r_xprt ) ;
2007-09-10 13:50:12 -04:00
2015-01-21 11:02:37 -05:00
cancel_delayed_work_sync ( & r_xprt - > rx_connect_worker ) ;
2007-09-10 13:50:12 -04:00
2019-04-24 09:40:25 -04:00
rpcrdma_ep_destroy ( r_xprt ) ;
2015-09-21 12:24:23 -05:00
rpcrdma_buffer_destroy ( & r_xprt - > rx_buf ) ;
2007-09-10 13:50:12 -04:00
rpcrdma_ia_close ( & r_xprt - > rx_ia ) ;
xprt_rdma_free_addresses ( xprt ) ;
2010-09-29 16:03:13 +04:00
xprt_free ( xprt ) ;
2007-09-10 13:50:12 -04:00
module_put ( THIS_MODULE ) ;
}
2019-06-19 10:33:42 -04:00
/* 60 second timeout, no retries */
2007-12-20 16:03:54 -05:00
static const struct rpc_timeout xprt_rdma_default_timeout = {
. to_initval = 60 * HZ ,
. to_maxval = 60 * HZ ,
} ;
2007-09-10 13:50:12 -04:00
/**
* xprt_setup_rdma - Set up transport to use RDMA
*
* @ args : rpc transport arguments
*/
static struct rpc_xprt *
xprt_setup_rdma ( struct xprt_create * args )
{
struct rpc_xprt * xprt ;
struct rpcrdma_xprt * new_xprt ;
2015-08-03 13:02:41 -04:00
struct sockaddr * sap ;
2007-09-10 13:50:12 -04:00
int rc ;
2018-12-19 10:59:39 -05:00
if ( args - > addrlen > sizeof ( xprt - > addr ) )
2007-09-10 13:50:12 -04:00
return ERR_PTR ( - EBADF ) ;
2018-05-04 15:35:09 -04:00
xprt = xprt_alloc ( args - > net , sizeof ( struct rpcrdma_xprt ) , 0 , 0 ) ;
2018-12-19 10:59:39 -05:00
if ( ! xprt )
2007-09-10 13:50:12 -04:00
return ERR_PTR ( - ENOMEM ) ;
2007-12-20 16:03:55 -05:00
xprt - > timeout = & xprt_rdma_default_timeout ;
2019-06-19 10:33:42 -04:00
xprt - > connect_timeout = xprt - > timeout - > to_initval ;
xprt - > max_reconnect_timeout = xprt - > timeout - > to_maxval ;
2014-05-28 10:34:32 -04:00
xprt - > bind_timeout = RPCRDMA_BIND_TO ;
xprt - > reestablish_timeout = RPCRDMA_INIT_REEST_TO ;
xprt - > idle_timeout = RPCRDMA_IDLE_DISC_TO ;
2007-09-10 13:50:12 -04:00
xprt - > resvport = 0 ; /* privileged port not needed */
xprt - > ops = & xprt_rdma_procs ;
/*
* Set up RDMA - specific connect data .
*/
2017-12-14 20:56:58 -05:00
sap = args - > dstaddr ;
2007-09-10 13:50:12 -04:00
/* Ensure xprt->addr holds valid server TCP (not RDMA)
* address , for any side protocols which peek at it */
xprt - > prot = IPPROTO_TCP ;
xprt - > addrlen = args - > addrlen ;
2015-08-03 13:02:41 -04:00
memcpy ( & xprt - > addr , sap , xprt - > addrlen ) ;
2007-09-10 13:50:12 -04:00
2015-08-03 13:02:41 -04:00
if ( rpc_get_port ( sap ) )
2007-09-10 13:50:12 -04:00
xprt_set_bound ( xprt ) ;
2017-12-14 20:56:50 -05:00
xprt_rdma_format_addresses ( xprt , sap ) ;
2007-09-10 13:50:12 -04:00
new_xprt = rpcx_to_rdmax ( xprt ) ;
2017-12-14 20:56:58 -05:00
rc = rpcrdma_ia_open ( new_xprt ) ;
2007-09-10 13:50:12 -04:00
if ( rc )
goto out1 ;
2019-04-24 09:40:25 -04:00
rc = rpcrdma_ep_create ( new_xprt ) ;
2007-09-10 13:50:12 -04:00
if ( rc )
goto out2 ;
2015-01-21 11:03:44 -05:00
rc = rpcrdma_buffer_create ( new_xprt ) ;
2007-09-10 13:50:12 -04:00
if ( rc )
goto out3 ;
2015-01-21 11:02:37 -05:00
INIT_DELAYED_WORK ( & new_xprt - > rx_connect_worker ,
xprt_rdma_connect_worker ) ;
2007-09-10 13:50:12 -04:00
2018-12-19 10:59:01 -05:00
xprt - > max_payload = frwr_maxpages ( new_xprt ) ;
2015-03-30 14:34:30 -04:00
if ( xprt - > max_payload = = 0 )
goto out4 ;
xprt - > max_payload < < = PAGE_SHIFT ;
2014-07-29 17:23:34 -04:00
dprintk ( " RPC: %s: transport data payload maximum: %zu bytes \n " ,
__func__ , xprt - > max_payload ) ;
2007-09-10 13:50:12 -04:00
if ( ! try_module_get ( THIS_MODULE ) )
goto out4 ;
2015-08-03 13:02:41 -04:00
dprintk ( " RPC: %s: %s:%s \n " , __func__ ,
xprt - > address_strings [ RPC_DISPLAY_ADDR ] ,
xprt - > address_strings [ RPC_DISPLAY_PORT ] ) ;
2017-12-20 16:31:29 -05:00
trace_xprtrdma_create ( new_xprt ) ;
2007-09-10 13:50:12 -04:00
return xprt ;
out4 :
2017-12-14 20:56:01 -05:00
rpcrdma_buffer_destroy ( & new_xprt - > rx_buf ) ;
rc = - ENODEV ;
2007-09-10 13:50:12 -04:00
out3 :
2019-04-24 09:40:25 -04:00
rpcrdma_ep_destroy ( new_xprt ) ;
2007-09-10 13:50:12 -04:00
out2 :
rpcrdma_ia_close ( & new_xprt - > rx_ia ) ;
out1 :
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_destroy ( new_xprt ) ;
2017-12-14 20:56:50 -05:00
xprt_rdma_free_addresses ( xprt ) ;
2010-09-29 16:03:13 +04:00
xprt_free ( xprt ) ;
2007-09-10 13:50:12 -04:00
return ERR_PTR ( rc ) ;
}
2017-04-11 13:23:10 -04:00
/**
2018-10-01 14:26:40 -04:00
* xprt_rdma_close - close a transport connection
* @ xprt : transport context
2017-04-11 13:23:10 -04:00
*
2018-12-19 11:00:00 -05:00
* Called during autoclose or device removal .
*
2018-10-01 14:26:40 -04:00
* Caller holds @ xprt ' s send lock to prevent activity on this
* transport while the connection is torn down .
2007-09-10 13:50:12 -04:00
*/
2018-12-19 10:58:40 -05:00
void xprt_rdma_close ( struct rpc_xprt * xprt )
2007-09-10 13:50:12 -04:00
{
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
2017-04-11 13:23:10 -04:00
struct rpcrdma_ep * ep = & r_xprt - > rx_ep ;
struct rpcrdma_ia * ia = & r_xprt - > rx_ia ;
2018-12-19 10:58:29 -05:00
might_sleep ( ) ;
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_close ( r_xprt ) ;
2018-12-19 10:58:29 -05:00
/* Prevent marshaling and sending of new requests */
xprt_clear_connected ( xprt ) ;
2007-09-10 13:50:12 -04:00
2017-04-11 13:23:10 -04:00
if ( test_and_clear_bit ( RPCRDMA_IAF_REMOVING , & ia - > ri_flags ) ) {
rpcrdma_ia_remove ( ia ) ;
2018-12-19 10:58:40 -05:00
goto out ;
2017-04-11 13:23:10 -04:00
}
2018-12-19 10:58:40 -05:00
2017-04-11 13:23:10 -04:00
if ( ep - > rep_connected = = - ENODEV )
return ;
rpcrdma_ep_disconnect ( ep , ia ) ;
2018-10-01 14:25:14 -04:00
2018-12-19 10:58:40 -05:00
out :
2019-08-26 13:12:51 -04:00
xprt - > reestablish_timeout = 0 ;
2018-12-19 10:58:40 -05:00
+ + xprt - > connect_cookie ;
xprt_disconnect_done ( xprt ) ;
2007-09-10 13:50:12 -04:00
}
2017-12-14 20:57:06 -05:00
/**
* xprt_rdma_set_port - update server port with rpcbind result
* @ xprt : controlling RPC transport
* @ port : new port value
*
* Transport connect status is unchanged .
*/
2007-09-10 13:50:12 -04:00
static void
xprt_rdma_set_port ( struct rpc_xprt * xprt , u16 port )
{
2017-12-14 20:57:06 -05:00
struct sockaddr * sap = ( struct sockaddr * ) & xprt - > addr ;
char buf [ 8 ] ;
dprintk ( " RPC: %s: setting port for xprt %p (%s:%s) to %u \n " ,
__func__ , xprt ,
xprt - > address_strings [ RPC_DISPLAY_ADDR ] ,
xprt - > address_strings [ RPC_DISPLAY_PORT ] ,
port ) ;
rpc_set_port ( sap , port ) ;
2007-09-10 13:50:12 -04:00
2017-12-14 20:57:06 -05:00
kfree ( xprt - > address_strings [ RPC_DISPLAY_PORT ] ) ;
snprintf ( buf , sizeof ( buf ) , " %u " , port ) ;
xprt - > address_strings [ RPC_DISPLAY_PORT ] = kstrdup ( buf , GFP_KERNEL ) ;
kfree ( xprt - > address_strings [ RPC_DISPLAY_HEX_PORT ] ) ;
snprintf ( buf , sizeof ( buf ) , " %4hx " , port ) ;
xprt - > address_strings [ RPC_DISPLAY_HEX_PORT ] = kstrdup ( buf , GFP_KERNEL ) ;
2007-09-10 13:50:12 -04:00
}
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.
Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-11 13:22:46 -04:00
/**
* xprt_rdma_timer - invoked when an RPC times out
* @ xprt : controlling RPC transport
* @ task : RPC task that timed out
*
* Invoked when the transport is still connected , but an RPC
* retransmit timeout occurs .
*
* Since RDMA connections don ' t have a keep - alive , forcibly
* disconnect and retry to connect . This drives full
* detection of the network path , and retransmissions of
* all pending RPCs .
*/
static void
xprt_rdma_timer ( struct rpc_xprt * xprt , struct rpc_task * task )
{
xprt_force_disconnect ( xprt ) ;
}
2018-10-01 14:26:40 -04:00
/**
2019-06-19 10:33:42 -04:00
* xprt_rdma_set_connect_timeout - set timeouts for establishing a connection
* @ xprt : controlling transport instance
* @ connect_timeout : reconnect timeout after client disconnects
* @ reconnect_timeout : reconnect timeout after server disconnects
*
*/
2019-08-19 18:49:30 -04:00
static void xprt_rdma_set_connect_timeout ( struct rpc_xprt * xprt ,
unsigned long connect_timeout ,
unsigned long reconnect_timeout )
2019-06-19 10:33:42 -04:00
{
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
trace_xprtrdma_op_set_cto ( r_xprt , connect_timeout , reconnect_timeout ) ;
spin_lock ( & xprt - > transport_lock ) ;
if ( connect_timeout < xprt - > connect_timeout ) {
struct rpc_timeout to ;
unsigned long initval ;
to = * xprt - > timeout ;
initval = connect_timeout ;
if ( initval < RPCRDMA_INIT_REEST_TO < < 1 )
initval = RPCRDMA_INIT_REEST_TO < < 1 ;
to . to_initval = initval ;
to . to_maxval = initval ;
r_xprt - > rx_timeout = to ;
xprt - > timeout = & r_xprt - > rx_timeout ;
xprt - > connect_timeout = connect_timeout ;
}
if ( reconnect_timeout < xprt - > max_reconnect_timeout )
xprt - > max_reconnect_timeout = reconnect_timeout ;
spin_unlock ( & xprt - > transport_lock ) ;
}
/**
* xprt_rdma_connect - schedule an attempt to reconnect
2018-10-01 14:26:40 -04:00
* @ xprt : transport state
2019-06-19 10:33:42 -04:00
* @ task : RPC scheduler context ( unused )
2018-10-01 14:26:40 -04:00
*
*/
2007-09-10 13:50:12 -04:00
static void
2013-01-08 09:26:49 -05:00
xprt_rdma_connect ( struct rpc_xprt * xprt , struct rpc_task * task )
2007-09-10 13:50:12 -04:00
{
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
2019-06-19 10:33:42 -04:00
unsigned long delay ;
2007-09-10 13:50:12 -04:00
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_connect ( r_xprt ) ;
2019-06-19 10:33:42 -04:00
delay = 0 ;
2010-04-16 16:41:57 -04:00
if ( r_xprt - > rx_ep . rep_connected ! = 0 ) {
2019-06-19 10:33:42 -04:00
delay = xprt_reconnect_delay ( xprt ) ;
xprt_reconnect_backoff ( xprt , RPCRDMA_INIT_REEST_TO ) ;
2007-09-10 13:50:12 -04:00
}
2019-06-19 10:33:42 -04:00
queue_delayed_work ( xprtiod_workqueue , & r_xprt - > rx_connect_worker ,
delay ) ;
2007-09-10 13:50:12 -04:00
}
2018-05-04 15:35:04 -04:00
/**
* xprt_rdma_alloc_slot - allocate an rpc_rqst
* @ xprt : controlling RPC transport
* @ task : RPC task requesting a fresh rpc_rqst
*
* tk_status values :
* % 0 if task - > tk_rqstp points to a fresh rpc_rqst
* % - EAGAIN if no rpc_rqst is available ; queued on backlog
*/
static void
xprt_rdma_alloc_slot ( struct rpc_xprt * xprt , struct rpc_task * task )
{
2018-05-04 15:35:09 -04:00
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
struct rpcrdma_req * req ;
2018-05-04 15:35:04 -04:00
2018-05-04 15:35:09 -04:00
req = rpcrdma_buffer_get ( & r_xprt - > rx_buf ) ;
if ( ! req )
2018-05-04 15:35:04 -04:00
goto out_sleep ;
2018-05-04 15:35:09 -04:00
task - > tk_rqstp = & req - > rl_slot ;
2018-05-04 15:35:04 -04:00
task - > tk_status = 0 ;
return ;
out_sleep :
2019-08-19 18:43:17 -04:00
set_bit ( XPRT_CONGESTED , & xprt - > state ) ;
2018-05-04 15:35:04 -04:00
rpc_sleep_on ( & xprt - > backlog , task , NULL ) ;
task - > tk_status = - EAGAIN ;
}
/**
* xprt_rdma_free_slot - release an rpc_rqst
* @ xprt : controlling RPC transport
* @ rqst : rpc_rqst to release
*
*/
static void
xprt_rdma_free_slot ( struct rpc_xprt * xprt , struct rpc_rqst * rqst )
{
2019-06-19 10:33:36 -04:00
struct rpcrdma_xprt * r_xprt =
container_of ( xprt , struct rpcrdma_xprt , rx_xprt ) ;
2018-05-04 15:35:04 -04:00
memset ( rqst , 0 , sizeof ( * rqst ) ) ;
2019-06-19 10:33:36 -04:00
rpcrdma_buffer_put ( & r_xprt - > rx_buf , rpcr_to_rdmar ( rqst ) ) ;
2019-08-19 18:43:17 -04:00
if ( unlikely ( ! rpc_wake_up_next ( & xprt - > backlog ) ) )
clear_bit ( XPRT_CONGESTED , & xprt - > state ) ;
2018-05-04 15:35:04 -04:00
}
2019-04-24 09:39:27 -04:00
static bool rpcrdma_check_regbuf ( struct rpcrdma_xprt * r_xprt ,
struct rpcrdma_regbuf * rb , size_t size ,
gfp_t flags )
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 10:55:53 -04:00
{
2019-04-24 09:39:27 -04:00
if ( unlikely ( rdmab_length ( rb ) < size ) ) {
if ( ! rpcrdma_regbuf_realloc ( rb , size , flags ) )
return false ;
r_xprt - > rx_stats . hardway_register_count + = size ;
}
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 10:55:53 -04:00
return true ;
}
2016-09-15 10:55:20 -04:00
/**
* xprt_rdma_allocate - allocate transport resources for an RPC
* @ task : RPC task
*
* Return values :
* 0 : Success ; rq_buffer points to RPC buffer to use
* ENOMEM : Out of memory , call again later
* EIO : A permanent error occurred , do not retry
2007-09-10 13:50:12 -04:00
*/
2016-09-15 10:55:20 -04:00
static int
xprt_rdma_allocate ( struct rpc_task * task )
2007-09-10 13:50:12 -04:00
{
2016-09-15 10:55:20 -04:00
struct rpc_rqst * rqst = task - > tk_rqstp ;
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( rqst - > rq_xprt ) ;
2018-05-04 15:35:09 -04:00
struct rpcrdma_req * req = rpcr_to_rdmar ( rqst ) ;
2015-01-26 17:11:47 -05:00
gfp_t flags ;
2007-09-10 13:50:12 -04:00
2016-01-07 14:50:10 -05:00
flags = RPCRDMA_DEF_GFP ;
2015-01-26 17:11:47 -05:00
if ( RPC_IS_SWAPPER ( task ) )
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN ;
2019-04-24 09:39:27 -04:00
if ( ! rpcrdma_check_regbuf ( r_xprt , req - > rl_sendbuf , rqst - > rq_callsize ,
flags ) )
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 10:55:53 -04:00
goto out_fail ;
2019-04-24 09:39:27 -04:00
if ( ! rpcrdma_check_regbuf ( r_xprt , req - > rl_recvbuf , rqst - > rq_rcvsize ,
flags ) )
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 10:55:53 -04:00
goto out_fail ;
2019-04-24 09:39:16 -04:00
rqst - > rq_buffer = rdmab_data ( req - > rl_sendbuf ) ;
rqst - > rq_rbuffer = rdmab_data ( req - > rl_recvbuf ) ;
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_allocate ( task , req ) ;
2016-09-15 10:55:20 -04:00
return 0 ;
2015-01-21 11:04:08 -05:00
out_fail :
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_allocate ( task , NULL ) ;
2016-09-15 10:55:20 -04:00
return - ENOMEM ;
2007-09-10 13:50:12 -04:00
}
2016-09-15 10:55:29 -04:00
/**
* xprt_rdma_free - release resources allocated by xprt_rdma_allocate
* @ task : RPC task
*
* Caller guarantees rqst - > rq_buffer is non - NULL .
2007-09-10 13:50:12 -04:00
*/
static void
2016-09-15 10:55:29 -04:00
xprt_rdma_free ( struct rpc_task * task )
2007-09-10 13:50:12 -04:00
{
2016-09-15 10:55:29 -04:00
struct rpc_rqst * rqst = task - > tk_rqstp ;
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( rqst - > rq_xprt ) ;
struct rpcrdma_req * req = rpcr_to_rdmar ( rqst ) ;
2007-09-10 13:50:12 -04:00
2018-12-19 11:00:00 -05:00
trace_xprtrdma_op_free ( task , req ) ;
2019-06-19 10:33:15 -04:00
if ( ! list_empty ( & req - > rl_registered ) )
frwr_unmap_sync ( r_xprt , req ) ;
/* XXX: If the RPC is completing because of a signal and
* not because a reply was received , we ought to ensure
* that the Send completion has fired , so that memory
* involved with the Send is not still visible to the NIC .
*/
2007-09-10 13:50:12 -04:00
}
2016-06-29 13:53:43 -04:00
/**
* xprt_rdma_send_request - marshal and send an RPC request
2018-09-03 23:58:59 -04:00
* @ rqst : RPC message in rq_snd_buf
2016-06-29 13:53:43 -04:00
*
2017-04-11 13:23:10 -04:00
* Caller holds the transport ' s write lock .
*
2017-12-14 20:57:31 -05:00
* Returns :
* % 0 if the RPC message has been sent
* % - ENOTCONN if the caller should reconnect and call again
2018-02-28 15:30:44 -05:00
* % - EAGAIN if the caller should call again
* % - ENOBUFS if the caller should call again after a delay
2018-12-19 10:58:45 -05:00
* % - EMSGSIZE if encoding ran out of buffer space . The request
* was not sent . Do not try to send this message again .
* % - EIO if an I / O error occurred . The request was not sent .
* Do not try to send this message again .
2007-09-10 13:50:12 -04:00
*/
static int
2018-09-03 23:58:59 -04:00
xprt_rdma_send_request ( struct rpc_rqst * rqst )
2007-09-10 13:50:12 -04:00
{
2013-01-08 09:10:21 -05:00
struct rpc_xprt * xprt = rqst - > rq_xprt ;
2007-09-10 13:50:12 -04:00
struct rpcrdma_req * req = rpcr_to_rdmar ( rqst ) ;
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
2014-07-29 17:23:43 -04:00
int rc = 0 ;
2007-09-10 13:50:12 -04:00
2017-12-14 20:57:31 -05:00
# if defined(CONFIG_SUNRPC_BACKCHANNEL)
if ( unlikely ( ! rqst - > rq_buffer ) )
return xprt_rdma_bc_send_reply ( rqst ) ;
# endif /* CONFIG_SUNRPC_BACKCHANNEL */
2017-04-11 13:23:10 -04:00
if ( ! xprt_connected ( xprt ) )
2018-12-19 10:58:40 -05:00
return - ENOTCONN ;
2017-04-11 13:23:10 -04:00
2018-09-03 17:37:36 -04:00
if ( ! xprt_request_get_cong ( xprt , rqst ) )
return - EBADSLT ;
2017-08-10 12:47:12 -04:00
rc = rpcrdma_marshal_req ( r_xprt , rqst ) ;
2014-07-29 17:23:43 -04:00
if ( rc < 0 )
goto failed_marshal ;
2007-09-10 13:50:12 -04:00
2008-10-09 15:00:40 -04:00
/* Must suppress retransmit to maintain credits */
2018-02-28 15:30:38 -05:00
if ( rqst - > rq_connect_cookie = = xprt - > connect_cookie )
2008-10-09 15:00:40 -04:00
goto drop_connection ;
2018-03-05 15:13:07 -05:00
rqst - > rq_xtime = ktime_get ( ) ;
2008-10-09 15:00:40 -04:00
if ( rpcrdma_ep_post ( & r_xprt - > rx_ia , & r_xprt - > rx_ep , req ) )
goto drop_connection ;
2007-09-10 13:50:12 -04:00
2010-05-13 12:51:49 -04:00
rqst - > rq_xmit_bytes_sent + = rqst - > rq_snd_buf . len ;
2018-02-28 15:30:54 -05:00
/* An RPC with no reply will throw off credit accounting,
* so drop the connection to reset the credit grant .
*/
2018-08-30 13:27:29 -04:00
if ( ! rpc_reply_expected ( rqst - > rq_task ) )
2018-02-28 15:30:54 -05:00
goto drop_connection ;
2007-09-10 13:50:12 -04:00
return 0 ;
2008-10-09 15:00:40 -04:00
2014-05-28 10:35:14 -04:00
failed_marshal :
2016-06-29 13:53:43 -04:00
if ( rc ! = - ENOTCONN )
return rc ;
2008-10-09 15:00:40 -04:00
drop_connection :
2018-12-19 10:58:40 -05:00
xprt_rdma_close ( xprt ) ;
return - ENOTCONN ;
2007-09-10 13:50:12 -04:00
}
2016-01-07 14:50:10 -05:00
void xprt_rdma_print_stats ( struct rpc_xprt * xprt , struct seq_file * seq )
2007-09-10 13:50:12 -04:00
{
struct rpcrdma_xprt * r_xprt = rpcx_to_rdmax ( xprt ) ;
long idle_time = 0 ;
if ( xprt_connected ( xprt ) )
idle_time = ( long ) ( jiffies - xprt - > last_used ) / HZ ;
2015-08-03 13:04:36 -04:00
seq_puts ( seq , " \t xprt: \t rdma " ) ;
seq_printf ( seq , " %u %lu %lu %lu %ld %lu %lu %lu %llu %llu " ,
0 , /* need a local port? */
xprt - > stat . bind_count ,
xprt - > stat . connect_count ,
2018-10-01 14:25:41 -04:00
xprt - > stat . connect_time / HZ ,
2015-08-03 13:04:36 -04:00
idle_time ,
xprt - > stat . sends ,
xprt - > stat . recvs ,
xprt - > stat . bad_xids ,
xprt - > stat . req_u ,
xprt - > stat . bklog_u ) ;
2016-06-29 13:52:54 -04:00
seq_printf ( seq , " %lu %lu %lu %llu %llu %llu %llu %lu %lu %lu %lu " ,
2015-08-03 13:04:36 -04:00
r_xprt - > rx_stats . read_chunk_count ,
r_xprt - > rx_stats . write_chunk_count ,
r_xprt - > rx_stats . reply_chunk_count ,
r_xprt - > rx_stats . total_rdma_request ,
r_xprt - > rx_stats . total_rdma_reply ,
r_xprt - > rx_stats . pullup_copy_count ,
r_xprt - > rx_stats . fixup_copy_count ,
r_xprt - > rx_stats . hardway_register_count ,
r_xprt - > rx_stats . failed_marshal_count ,
2015-08-03 13:04:45 -04:00
r_xprt - > rx_stats . bad_reply_count ,
r_xprt - > rx_stats . nomsg_call_count ) ;
2017-10-20 10:48:36 -04:00
seq_printf ( seq , " %lu %lu %lu %lu %lu %lu \n " ,
2018-10-01 14:25:25 -04:00
r_xprt - > rx_stats . mrs_recycled ,
2016-06-29 13:54:00 -04:00
r_xprt - > rx_stats . mrs_orphaned ,
2016-09-15 10:57:16 -04:00
r_xprt - > rx_stats . mrs_allocated ,
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 10:48:12 -04:00
r_xprt - > rx_stats . local_inv_needed ,
2017-10-20 10:48:36 -04:00
r_xprt - > rx_stats . empty_sendctx_q ,
r_xprt - > rx_stats . reply_waits_for_send ) ;
2007-09-10 13:50:12 -04:00
}
2015-06-03 16:14:29 -04:00
static int
xprt_rdma_enable_swap ( struct rpc_xprt * xprt )
{
2015-10-24 17:26:29 -04:00
return 0 ;
2015-06-03 16:14:29 -04:00
}
static void
xprt_rdma_disable_swap ( struct rpc_xprt * xprt )
{
}
2007-09-10 13:50:12 -04:00
/*
* Plumbing for rpc transport switch and kernel module
*/
2017-08-01 12:00:39 -04:00
static const struct rpc_xprt_ops xprt_rdma_procs = {
2014-05-28 10:34:57 -04:00
. reserve_xprt = xprt_reserve_xprt_cong ,
2007-09-10 13:50:12 -04:00
. release_xprt = xprt_release_xprt_cong , /* sunrpc/xprt.c */
2018-05-04 15:35:04 -04:00
. alloc_slot = xprt_rdma_alloc_slot ,
. free_slot = xprt_rdma_free_slot ,
2007-09-10 13:50:12 -04:00
. release_request = xprt_release_rqst_cong , /* ditto */
2019-04-07 13:58:46 -04:00
. wait_for_reply_request = xprt_wait_for_reply_request_def , /* ditto */
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.
Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-11 13:22:46 -04:00
. timer = xprt_rdma_timer ,
2007-09-10 13:50:12 -04:00
. rpcbind = rpcb_getport_async , /* sunrpc/rpcb_clnt.c */
. set_port = xprt_rdma_set_port ,
. connect = xprt_rdma_connect ,
. buf_alloc = xprt_rdma_allocate ,
. buf_free = xprt_rdma_free ,
. send_request = xprt_rdma_send_request ,
. close = xprt_rdma_close ,
. destroy = xprt_rdma_destroy ,
2019-08-19 18:49:30 -04:00
. set_connect_timeout = xprt_rdma_set_connect_timeout ,
2015-06-03 16:14:29 -04:00
. print_stats = xprt_rdma_print_stats ,
. enable_swap = xprt_rdma_enable_swap ,
. disable_swap = xprt_rdma_disable_swap ,
2015-10-24 17:27:43 -04:00
. inject_disconnect = xprt_rdma_inject_disconnect ,
# if defined(CONFIG_SUNRPC_BACKCHANNEL)
. bc_setup = xprt_rdma_bc_setup ,
2016-05-02 14:40:40 -04:00
. bc_maxpayload = xprt_rdma_bc_maxpayload ,
2019-07-16 13:51:29 -04:00
. bc_num_slots = xprt_rdma_bc_max_slots ,
2015-10-24 17:27:43 -04:00
. bc_free_rqst = xprt_rdma_bc_free_rqst ,
. bc_destroy = xprt_rdma_bc_destroy ,
# endif
2007-09-10 13:50:12 -04:00
} ;
static struct xprt_class xprt_rdma = {
. list = LIST_HEAD_INIT ( xprt_rdma . list ) ,
. name = " rdma " ,
. owner = THIS_MODULE ,
. ident = XPRT_TRANSPORT_RDMA ,
. setup = xprt_setup_rdma ,
} ;
2015-06-04 11:21:42 -04:00
void xprt_rdma_cleanup ( void )
2007-09-10 13:50:12 -04:00
{
2014-11-17 16:58:04 -05:00
# if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
2007-09-10 13:50:12 -04:00
if ( sunrpc_table_header ) {
unregister_sysctl_table ( sunrpc_table_header ) ;
sunrpc_table_header = NULL ;
}
# endif
2016-01-07 14:50:10 -05:00
2018-12-19 10:59:39 -05:00
xprt_unregister_transport ( & xprt_rdma ) ;
xprt_unregister_transport ( & xprt_rdma_bc ) ;
2007-09-10 13:50:12 -04:00
}
2015-06-04 11:21:42 -04:00
int xprt_rdma_init ( void )
2007-09-10 13:50:12 -04:00
{
int rc ;
2015-05-26 11:52:25 -04:00
rc = xprt_register_transport ( & xprt_rdma ) ;
2018-12-19 10:58:29 -05:00
if ( rc )
2015-05-26 11:52:25 -04:00
return rc ;
2016-01-07 14:50:10 -05:00
rc = xprt_register_transport ( & xprt_rdma_bc ) ;
if ( rc ) {
xprt_unregister_transport ( & xprt_rdma ) ;
return rc ;
}
2014-11-17 16:58:04 -05:00
# if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
2007-09-10 13:50:12 -04:00
if ( ! sunrpc_table_header )
sunrpc_table_header = register_sysctl_table ( sunrpc_table ) ;
# endif
return 0 ;
}