IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Clean up: Move more of struct rpcrdma_frwr into its parent.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up.
- Simplify variable initialization in the completion handlers.
- Move another field out of struct rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up (for several purposes):
- The MR's cid is initialized sooner so that tracepoints can show
something reasonable even if the MR is never posted.
- The MR's res.id doesn't change so the cid won't change either.
Initializing the cid once is sufficient.
- struct rpcrdma_frwr is going away soon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The handler only recorded a trace event. If indeed no
action is needed by the RPC/RDMA consumer, then the event can be
ignored.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The Send signaling logic is a little subtle, so add some
observability around it. For every xprtrdma_mr_fastreg event, there
should be an xprtrdma_mr_localinv or xprtrdma_mr_reminv event.
When these tracepoints are enabled, we can see exactly when an MR is
DMA-mapped, registered, invalidated (either locally or remotely) and
then DMA-unmapped.
kworker/u25:2-190 [000] 787.979512: xprtrdma_mr_map: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979515: xprtrdma_chunk_read: task:351@5 pos=148 5608@0x8679e0c8f6f56000:0x00000503 (last)
kworker/u25:2-190 [000] 787.979519: xprtrdma_marshal: task:351@5 xid=0x8679e0c8: hdr=52 xdr=148/5608/0 read list/inline
kworker/u25:2-190 [000] 787.979525: xprtrdma_mr_fastreg: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979526: xprtrdma_post_send: task:351@5 cq.id=0 cid=73 (2 SGEs)
...
kworker/5:1H-219 [005] 787.980567: xprtrdma_wc_receive: cq.id=1 cid=161 status=SUCCESS (0/0x0) received=164
kworker/5:1H-219 [005] 787.980571: xprtrdma_post_recvs: peer=[192.168.100.55]:20049 r_xprt=0xffff8884974d4000: 0 new recvs, 70 active (rc 0)
kworker/5:1H-219 [005] 787.980573: xprtrdma_reply: task:351@5 xid=0x8679e0c8 credits=64
kworker/5:1H-219 [005] 787.980576: xprtrdma_mr_reminv: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/5:1H-219 [005] 787.980577: xprtrdma_mr_unmap: mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
Note that I've moved the xprtrdma_post_send tracepoint so that event
always appears after the xprtrdma_mr_fastreg tracepoint. Otherwise
the event log looks counterintuitive (FastReg is always supposed to
happen before Send).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Send WRs can be signalled or unsignalled. A signalled Send WR
always has a matching Send completion, while a unsignalled Send
has a completion only if the Send WR fails.
xprtrdma has a Send account mechanism that is designed to reduce
the number of signalled Send WRs. This in turn mitigates the
interrupt rate of the underlying device.
RDMA consumers can't leave all Sends unsignaled, however, because
providers rely on Send completions to maintain their Send Queue head
and tail pointers. xprtrdma counts the number of unsignaled Send WRs
that have been posted to ensure that Sends are signalled often
enough to prevent the Send Queue from wrapping.
This mechanism neglected to account for FastReg WRs, which are
posted on the Send Queue but never signalled. As a result, the
Send Queue wrapped on occasion, resulting in duplication completions
of FastReg and LocalInv WRs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Throw away any reply where the LocalInv flushes or could not be
posted. The registered memory region is in an unknown state until
the disconnect completes.
rpcrdma_xprt_disconnect() will find and release the MR. No need to
put it back on the MR free list in this case.
The client retransmits pending RPC requests once it reestablishes a
fresh connection, so a replacement reply should be forthcoming on
the next connection instance.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.
I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.
Fixes: 847568942f93 ("xprtrdma: Remove fr_state")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.
The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().
Fixes: 2ae50ad68cd7 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Defensive clean up: Protect the rb_all_reps list during rep
creation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.
Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.
On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.
This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().
To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.
Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.
This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.
Fixes: e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.
The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.
Fixes: 4a0682583988 ("SUNRPC: Transport fault injection")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
I tested commit 43042b90cae1 ("svcrdma: Reduce Receive doorbell
rate") with mlx4 (IB) and software iWARP and didn't find any
issues. However, I recently got my hardware iWARP setup back on
line (FastLinQ) and it's crashing hard on this commit (confirmed
via bisect).
The failure mode is complex.
- After a connection is established, the first Receive completes
normally.
- But the second and third Receives have garbage in their Receive
buffers. The server responds with ERR_VERS as a result.
- When the client tears down the connection to retry, a couple
of posted Receives flush twice, and that corrupts the recv_ctxt
free list.
- __svc_rdma_free then faults or loops infinitely while destroying
the xprt's recv_ctxts.
Since 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") does
not fix a bug but is a scalability enhancement, it's safe and
appropriate to revert it while working on a replacement.
Fixes: 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This brings it in line with the regular tcp backchannel, which also has
all those timeouts disabled.
Prevents the backchannel from timing out, getting some async operations
like server side copying getting stuck indefinitely on the client side.
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Fixes: 5d252f90a800 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- New Features:
- Support for eager writes, and the write=eager and write=wait mount options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmAwOLwACgkQ18tUv7Cl
QOsUfw//W2KoJ+2IQohQNFcoi+bG1OQE7jnqHtQ+tsKfpJKemcDcu8wQEAqrwALg
vXioG1Ye0QU7P5PZtNxCorylqSTVGvJSIOrfa3lTdn/PDbI7NIgN52w56TzzfeXn
pJ4gDwZzPwUFUblF0LBQUIhJv5IQvOXVgUsMqezbIbMXSiuLR/bjnZ96Q/woKpoL
eg2IZ5EO9Jb0QjuQ1e9U303X7c2qOl1jzpxyQLQfD7ONnWBx3HnJk1l+3JJRi8JV
smnae3I0L3nUZ7rBqoqsvK7YUjUchCEBvkmEMsnHT94D5tI9mxxX5OquREee6QHn
NuJRSNbsIiCD3Ne27fkCut78d6SetoMko7jZ97T6smhyijtXJiLG/6dycMPV9rt/
bVIudWMm9/A9AsXyY2YP5LC6Y6W6dhQRXygUjVgEPBl6kVsb2Eca8IA9QZghF9IL
+XSEulASvxo2rWPylJJ+3aLynfqoHrowVN/Tu61svDnJWTcb+FCxQ5zyLox7erEH
mUhraf1D0uoX9odH1069toN6favZFE6SIDvlUk1QTOjr6p3Jxmkuyl6PNs5t66/S
550z5JVb2deIHOPQxOie7xz/Dk6dnRoaFhTNq/Ootkt9GNe0A+NqSUdoRA5XxN5m
wW11ecLSZSehDksuXjyFmkHtkagLreFxLsHbVnaAtwEm7h/thRI=
=Dssn
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS Client Updates from Anna Schumaker:
"New Features:
- Support for eager writes, and the write=eager and write=wait mount
options
- Other Bugfixes and Cleanups:
- Fix typos in some comments
- Fix up fall-through warnings for Clang
- Cleanups to the NFS readpage codepath
- Remove FMR support in rpcrdma_convert_iovs()
- Various other cleanups to xprtrdma
- Fix xprtrdma pad optimization for servers that don't support
RFC 8797
- Improvements to rpcrdma tracepoints
- Fix up nfs4_bitmask_adjust()
- Optimize sparse writes past the end of files"
* tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Support the '-owrite=' option in /proc/self/mounts and mountinfo
NFS: Set the stable writes flag when initialising the super block
NFS: Add mount options supporting eager writes
NFS: Add support for eager writes
NFS: 'flags' field should be unsigned in struct nfs_server
NFS: Don't set NFS_INO_INVALID_XATTR if there is no xattr cache
NFS: Always clear an invalid mapping when attempting a buffered write
NFS: Optimise sparse writes past the end of file
NFS: Fix documenting comment for nfs_revalidate_file_size()
NFSv4: Fixes for nfs4_bitmask_adjust()
xprtrdma: Clean up rpcrdma_prepare_readch()
rpcrdma: Capture bytes received in Receive completion tracepoints
xprtrdma: Pad optimization, revisited
rpcrdma: Fix comments about reverse-direction operation
xprtrdma: Refactor invocations of offset_in_page()
xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map()
xprtrdma: Remove FMR support in rpcrdma_convert_iovs()
NFS: Add nfs_pageio_complete_read() and remove nfs_readpage_async()
NFS: Call readpage_async_filler() from nfs_readpage_async()
NFS: Refactor nfs_readpage() and nfs_readpage_async() to use nfs_readdesc
...
- Cork the socket while there are queued replies
Fixes:
- DRC shutdown ordering
- svc_rdma_accept() lockdep splat
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAsA80ACgkQM2qzM29m
f5erXA/+MrR3ZtwK2eaTITu13TzzTrMURbp/n0wCCW/Ls1YMb6bn9ggtBwu2W5Cn
Vb0RO9OLcmoI6CjqPh0CTUvvZspMYOAX4W1jQecKt2ml075APdlqUcv9YWPUQqVJ
qTg8HxDymvHvY3I3FcBxhzofmGzF8AOmQZJw9uI5Wt/ivBfqGWcAGlxyRmB3mdsm
cJRK0Sy7QMn2LefMcpMEeSbPA049/NZNRp6fcXnpPQFer42thoosYsNhTlAJfCXC
C5S0z3/T6rpuJucV9la/WkpUA0YhWbPEHWNdAB5tzSqmoEo4LpzJzjv7uyQU4oue
QlmChIz9qasgTI/BnCkBIzPD99S4UQcXjX0BnNinkQ77e6+b/vdAR+T+NLHJdkAf
+7Xz6T9aZNaz2R49CjYl6/kG0rlNkjUzyURRYs/9zEBhogMPH/N4T7Z2M+ljCkeb
tc3OaFDXZ2rfr7EKBGsfnEKINM1gpYipzILkr8GSHUMZLzOB/64upKySaJVjCGXj
7Sf1w+vJUWwYc+FqFvbaR4ybr01VIfdsecpn1TtY870zG1JzimzAHVZk1/xC9+CX
J+lVOXbjawDl1Et3V3fWq6Y7mhAWves/NKPcbSug9sFc4qRHEmPbAq/RRtlsjQcn
foMr5R8qd8OwEamVypZ2nIFxq4q3b742AS8lZhaK+DyZKq3oLac=
=+R4U
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull more nfsd updates from Chuck Lever:
"Here are a few additional NFSD commits for the merge window:
Optimization:
- Cork the socket while there are queued replies
Fixes:
- DRC shutdown ordering
- svc_rdma_accept() lockdep splat"
* tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Further clean up svc_tcp_sendmsg()
SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()
SUNRPC: Use TCP_CORK to optimise send performance on the server
svcrdma: Hold private mutex while invoking rdma_accept()
nfsd: register pernet ops last, unregister first
RDMA core mutex locking was restructured by commit d114c6feedfe
("RDMA/cma: Add missing locking to rdma_accept()") [Aug 2020]. When
lock debugging is enabled, the RPC/RDMA server trips over the new
lockdep assertion in rdma_accept() because it doesn't call
rdma_accept() from its CM event handler.
As a temporary fix, have svc_rdma_accept() take the handler_mutex
explicitly. In the meantime, let's consider how to restructure the
RPC/RDMA transport to invoke rdma_accept() from the proper context.
Calls to svc_rdma_accept() are serialized with calls to
svc_rdma_free() by the generic RPC server layer.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-rdma/20210209154014.GO4247@nvidia.com/
Fixes: d114c6feedfe ("RDMA/cma: Add missing locking to rdma_accept()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit 9ed5af268e88 ("SUNRPC: Clean up the handling of page
padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client
passes payload data to the transport with the padding in xdr->pages
instead of in the send buffer's tail kvec. There's no need for the
extra logic to advance the base of the tail kvec because the upper
layer no longer places XDR padding there.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NetApp Linux team discovered that with NFS/RDMA servers that do
not support RFC 8797, the Linux client is forming NFSv4.x WRITE
requests incorrectly.
In this case, the Linux NFS client disables implicit chunk round-up
for odd-length Read and Write chunks. The goal was to support old
servers that needed that padding to be sent explicitly by clients.
In that case the Linux NFS included the tail kvec in the Read chunk,
since the tail contains any needed padding. That meant a separate
memory registration is needed for the tail kvec, adding to the cost
of forming such requests. To avoid that cost for a mere 3 bytes of
zeroes that are always ignored by receivers, we try to use implicit
roundup when possible.
For NFSv4.x, the tail kvec also sometimes contains a trailing
GETATTR operation. The Linux NFS client unintentionally includes
that GETATTR operation in the Read chunk as well as inline.
The fix is simply to /never/ include the tail kvec when forming a
data payload Read chunk. The padding is thus now always present.
Note that since commit 9ed5af268e88 ("SUNRPC: Clean up the handling
of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS
client passes payload data to the transport with the padding in
xdr->pages instead of in the send buffer's tail kvec. So now the
Linux NFS client appends XDR padding to all odd-sized Read chunks.
This shouldn't be a problem because:
- RFC 8166-compliant servers are supposed to work with or without
that XDR padding in Read chunks.
- Since the padding is now in the same memory region as the data
payload, a separate memory registration is not needed. In
addition, the link layer extends data in RDMA Read responses to
4-byte boundaries anyway. Thus there is now no savings when the
padding is not included.
Because older kernels include the payload's XDR padding in the
tail kvec, a fix there will be more complicated. Thus backporting
this patch is not recommended.
Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
During the final stages of publication of RFC 8167, reviewers
requested that we use the term "reverse direction" rather than
"backwards direction". Update comments to reflect this preference.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up so that offset_in_page() is invoked less often in the
most common case, which is mapping xdr->pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Remove a conditional branch from the SGL set-up loop in frwr_map():
Instead of using either sg_set_page() or sg_set_buf(), initialize
the mr_page field properly when rpcrdma_convert_kvec() converts the
kvec to an SGL entry. frwr_map() can then invoke sg_set_page()
unconditionally.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Support for FMR was removed by commit ba69cd122ece ("xprtrdma:
Remove support for FMR memory registration") [Dec 2018]. That means
the buffer-splitting behavior of rpcrdma_convert_kvec(), added by
commit 821c791a0bde ("xprtrdma: Segment head and tail XDR buffers
on page boundaries") [Mar 2016], is no longer necessary. FRWR
memory registration handles this case with aplomb.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The Receive completion handler doesn't look at the contents of the
Receive buffer. The DMA sync isn't terribly expensive but it's one
less thing that needs to be done by the Receive completion handler,
which is single-threaded (per svc_xprt). This helps scalability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This is similar to commit e340c2d6ef2a ("xprtrdma: Reduce the
doorbell rate (Receive)") which added Receive batching to the
client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. We are not permitted to remove old proc files. Instead,
convert these variables to stubs that are only ever allowed to
display a value of zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that we have an efficient mechanism to update these two stats,
let's start maintaining them again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Receives are frequent events. Avoid the overhead of a memory bus
lock cycle for counting a value that is hardly every used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support.
- A series of patches to improve readdir performance, particularly with
large directories.
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers.
- Micro-optimisations for RDMA.
- RDMA tracing improvements.
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving partial
pages (Dan Aloni).
- Various fixes for getxattr and listxattr, when used over non-TCP
transports.
- Fixes for containerised NFS from Sargun Dhillon.
- switch nfsiod to be an UNBOUND workqueue (Neil Brown).
- READDIR should not ask for security label information if there is no
LSM policy. (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum Mackay).
- A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS code.
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/aiaIACgkQZwvnipYK
APIOihAAvONscxrFSaGRh2ICNv9I/zXW/A5+R3qnkESPVLTqTPJVphoN7FlINAr1
B74pg6n4T4viycbvsogU2+kHrlJZO7B8lTkJL7ynm9Wgyw8+2Ga4QEn1bsAoqmuY
b91p/+LfOLKrYeeojoH31PC73uOYYG1WHXJhjq0l9b5CTgThWpj6O3gDaFEbFvmz
A7V3yqSp04sV70YxUhwelBHZ5BXdiXIKsPnIwvXXHuY7IcamrE4EA3wGCwtxkBnu
4dwbOtRXURNSev0r3n6FsH4wZl+/nvp9UpnGdPtVv94F1zm2JKLwkhoJejS/vpjq
eyKc7ZXBQ0uHbTWI2Yj1YjA61VIUO0R0EDuyTAnRKDeaarID42n5kMG7J8cIglZR
jQfyx99xm0eSrdwxC09tcRL/lBzYcOfc6pJo5P9BtaFtRvbp9iFIHuFKlrXbULd4
WrZzDMhiKVYGSTcTpfQyVoK2rCvn6W1Ida4iYeI0gkJ1v9X90UhbtJOyggn/bxyL
DV/Qy40+l48n7CZfPU2eDv4WXqjKGRibpDoWMBLwUH20dDEX6kKYv3BfApFYGqyO
/GTPAFUZarCy8BENvzZv/Jb9mt5pDQM5p9ZXpdUOhydLMMA+pauaT/Gr+pAHPIPx
MPj546Gh2cEaT883xvRrJmQTG0nw/WscPNcHaJcgL5oYltmuwck=
=IKWG
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
support
- A series of patches to improve readdir performance, particularly
with large directories
- Basic support for using NFS/RDMA with the pNFS files and flexfiles
drivers
- Micro-optimisations for RDMA
- RDMA tracing improvements
Bugfixes:
- Fix a long standing bug with xs_read_xdr_buf() when receiving
partial pages (Dan Aloni)
- Various fixes for getxattr and listxattr, when used over non-TCP
transports
- Fixes for containerised NFS from Sargun Dhillon
- switch nfsiod to be an UNBOUND workqueue (Neil Brown)
- READDIR should not ask for security label information if there is
no LSM policy (Olga Kornievskaia)
- Avoid using interval-based rebinding with TCP in lockd (Calum
Mackay)
- A series of RPC and NFS layer fixes to support the NFSv4.2
READ_PLUS code
- A couple of fixes for pnfs/flexfiles read failover
Cleanups:
- Various cleanups for the SUNRPC xdr code in conjunction with the
READ_PLUS fixes"
* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
NFSv4/pnfs: Add tracing for the deviceid cache
fs/lockd: convert comma to semicolon
NFSv4.2: fix error return on memory allocation failure
NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
NFSv4.2: decode_read_plus_hole() needs to check the extent offset
NFSv4.2: decode_read_plus_data() must skip padding after data segment
NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
SUNRPC: When expanding the buffer, we may need grow the sparse pages
SUNRPC: Cleanup - constify a number of xdr_buf helpers
SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
SUNRPC: _copy_to/from_pages() now check for zero length
SUNRPC: Cleanup xdr_shrink_bufhead()
SUNRPC: Fix xdr_expand_hole()
SUNRPC: Fixes for xdr_align_data()
SUNRPC: _shift_data_left/right_pages should check the shift length
...
Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.
To avoid crashing, handle sparse page allocation up front.
Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.
Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
An efficient way to handle multiple Read chunks is to post them all
together and then take a single completion. This is also how the
code is already structured: when the Read completion fires, all
portions of the incoming RPC message are available to be assembled.
The difficult problem is setting up the Read sink buffers so that
the server pulls the client's data into place, making subsequent
pull-up unnecessary. There are several cases:
* No Read chunks. No-op.
* One data item Read chunk. This is the fast case, where the inline
part of the RPC-over-RDMA message becomes the head and tail, and
the data item chunk is placed in buf->pages.
* A Position-zero Read chunk. Treated like TCP: the Read chunk is
pulled into contiguous pages.
+ A Position-zero Read chunk with data item chunks. Treated like
TCP: all of the Read chunks are pulled into contiguous pages.
+ Multiple data item chunks. Treated like TCP: the inline part is
copied and the data item chunks are pulled into contiguous pages.
The "*" cases are already supported. This patch adds support for the
"+" cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As a pre-requisite for handling multiple Read chunks in each Read
list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
chunk list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I'm about to change the purpose of ri_chunklen: Instead of tracking
the number of bytes in one Read chunk, it will track the total
number of bytes in the Read list. Rename it for clarity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We already have trace_svcrdma_decode_rseg(), which records each
ingress Read segment. Instead of reporting those again when they
are about to be posted as RDMA Reads, let's fire one tracepoint
before posting each type of chunk.
So we'll get:
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100
nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3
kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
of rq_res that do not contain a result payload.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
the parts of rq_res that do not contain a result payload.
This change has been tested to confirm that it does not cause a
regression in the no Write chunk and single Write chunk cases.
Multiple Write chunks have not been tested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>