Commit Graph

1267 Commits

Author SHA1 Message Date
Chuck Lever
c4fd9f4525 svcrdma: Acquire the svcxprt_rdma pointer from the CQ context
Enable the removal of the svc_rdma_chunk_ctxt::cc_rdma field in a
subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:28 -05:00
Chuck Lever
5ef6c66676 svcrdma: Reduce size of struct svc_rdma_rw_ctxt
SG_CHUNK_SIZE is 128, making struct svc_rdma_rw_ctxt + the first
SGL array more than 4200 bytes in length, pushing the memory
allocation well into order 1.

Even so, the RDMA rw core doesn't seem to use more than max_send_sge
entries in that array (typically 32 or less), so that is all wasted
space.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:28 -05:00
Chuck Lever
2dd6e29a3e svcrdma: Update some svcrdma DMA-related tracepoints
A send/recv_ctxt already records transport-related information
in the cq.id, thus there is no need to record the IP addresses of
the transport endpoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:28 -05:00
Chuck Lever
848760a9e7 svcrdma: DMA error tracepoints should report completion IDs
Update the DMA error flow tracepoints to report the completion ID of
the failing context. This ties the wait/failure to a particular
operation or request, which is more useful than knowing only the
failing transport.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:28 -05:00
Chuck Lever
ad3656bd84 svcrdma: SQ error tracepoints should report completion IDs
Update the Send Queue's error flow tracepoints to report the
completion ID of the waiting or failing context. This ties the
wait/failure to a particular operation or request, which is a little
more useful than knowing only the transport that is about to close.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
be2acb1048 rpcrdma: Introduce a simple cid tracepoint class
De-duplicate some code, making it easier to add new tracepoints that
report only a completion ID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
907e34a7d0 svcrdma: Add lockdep class keys for transport locks
Two svcrdma-related transport locks can become quite contended.
Collate their use and make them easy to find in /proc/lock_stat for
better observability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
bfb81535c2 svcrdma: Clean up locking
There's no need to protect llist_entry() with a spin lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
f09c36c8df svcrdma: Add an async version of svc_rdma_write_info_free()
DMA unmapping can take quite some time, so it should not be handled
in a single-threaded completion handler. Defer releasing write_info
structs to the recently-added workqueue.

With this patch, DMA unmapping can be handled in parallel, and it
does not cause head-of-queue blocking of Write completions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
ae225fe27b svcrdma: Add an async version of svc_rdma_send_ctxt_put()
DMA unmapping can take quite some time, so it should not be handled
in a single-threaded completion handler. Defer releasing send_ctxts
to the recently-added workqueue.

With this patch, DMA unmapping can be handled in parallel, and it
does not cause head-of-queue blocking of Send completions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
9c7e1a0658 svcrdma: Add a utility workqueue to svcrdma
To handle work in the background, set up an UNBOUND workqueue for
svcrdma. Subsequent patches will make use of it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
877118c667 svcrdma: Pre-allocate svc_rdma_recv_ctxt objects
The original reason for allocating svc_rdma_recv_ctxt objects during
Receive completion was to ensure the objects were allocated on the
NUMA node closest to the underlying IB device.

Since commit c5d68d25bd ("svcrdma: Clean up allocation of
svc_rdma_recv_ctxt"), however, the device's favored node is
explicitly passed to the memory allocator.

To enable switching Receive completion to soft IRQ context, move
memory allocation out of completion handling, since it can be
costly, and it can sleep.

A limited number of objects is now allocated at "accept" time.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
b541dd554b svcrdma: Eliminate allocation of recv_ctxt objects in backchannel
The svc_rdma_recv_ctxt free list uses a lockless list to avoid the
need for a spin lock in the fast path. llist_del_first(), which is
used by svc_rdma_recv_ctxt_get(), requires serialization, however,
when there are multiple list producers that are unserialized.

I mistakenly thought there was only one caller of
svc_rdma_recv_ctxt_get() (svc_rdma_refresh_recvs()), thus explicit
serialization would not be necessary. But there is another caller:
svc_rdma_bc_sendto(), and these two are not serialized against each
other. I haven't seen ill effects that I could directly ascribe to
a lack of serialization. It's just an observation based on code
audit.

When DMA-mapping before sending a Reply, the passed-in struct
svc_rdma_recv_ctxt is used only for its write and reply PCLs. These
are currently always empty in the backchannel case. So, instead of
passing a full svc_rdma_recv_ctxt object to
svc_rdma_map_reply_msg(), let's pass in just the Write and Reply
PCLs.

This change makes it unnecessary for the backchannel to acquire a
dummy svc_rdma_recv_ctxt object when sending an RPC Call. The need
for svc_rdma_recv_ctxt free list serialization is now completely
avoided.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
197115ebf3 svcrdma: Drop connection after an RDMA Read error
When an RPC Call message cannot be pulled from the client, that
is a message loss, by definition. Close the connection to trigger
the client to resend.

Cc: <stable@vger.kernel.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:40 -04:00
NeilBrown
15d39883ee SUNRPC: change the back-channel queue to lwq
This removes the need to store and update back-links in the list.
It also remove the need for the _bh version of spin_lock().

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:08 -04:00
NeilBrown
063ab935a4 SUNRPC: integrate back-channel processing with svc_recv()
Using svc_recv() for (NFSv4.1) back-channel handling means we have just
one mechanism for waking threads.

Also change kthread_freezable_should_stop() in nfs4_callback_svc() to
kthread_should_stop() as used elsewhere.
kthread_freezable_should_stop() effectively adds a try_to_freeze() call,
and svc_recv() already contains that at an appropriate place.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-10-16 12:44:03 -04:00
Yue Haibing
f9597ba887 xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv()
rpcrdma_bc_post_recv() is never implemented since introduction in
commit f531a5dbc4 ("xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers").

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-23 15:58:47 -04:00
Linus Torvalds
53663f4103 NFS client fixes for Linux 6.5
Highlights include:
 
 Stable fixes
  - NFS: Fix a use after free in nfs_direct_join_group()
 
 Bugfixes
  - NFS: Fix a sysfs server name memory leak
  - NFS: Fix a lock recovery hang in NFSv4.0
  - NFS: Fix page free in the error path for nfs42_proc_getxattr
  - NFS: Fix page free in the error path for __nfs4_get_acl_uncached
  - SUNRPC/rdma: Fix receive buffer dma-mapping after a server disconnect
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmTjgEEACgkQZwvnipYK
 APItFA//WzGcKbujlMXpiRdvUg6k6CfG/ikBRB1UwQEyZjK/tVZ96qt6UuHGNMbz
 b8GaGls7NRYJKezAcMSW9QMMPYVyG0PLwxOW6BPwsZS61Zn6HMeM1YRboaZEid7f
 JrUNhbUXHl6bVWrBNEtcr3IN/5ERU4sGCAa4A3uWdNxGyffD/avrK06/bfmE/SJi
 +7LVPp0M9rM5X5Z1c407TbWfg+L81Q9t0tTz7II3Ba9i2BzQ0uhQhyVUQAGF767u
 Vua4XWTRoqG1es+tA4iuwZ3KtaqXoaMRDWPLGTkmBrY+pAo+u4IPzY5LCwfUu6kI
 vttkZU5b0b05+UomJ1d+Muzr8uEjRmBhIHZsP6lgVVmuNzqkDb0gCGkfix87J+RO
 0QmDZ9D0ftJxsb8fSdp8iy8NqmqJ6X4FhsylRtANEuCrf8+zrkUlBJi47CCwpYDD
 8gq6SoTfA8MmiSgzrBuYkJe2HSx7c2csDl3xp5KrJX2IHODjbzlHC05fNadTWc6W
 0jQvq1cJ2xBYDNSxkG0Trsd3lTTao3rZC4M7imVVjTTOHS8X1LNCLkbZ7LVnA8rn
 0F+lp/h1qs/daXSp0aMG5wyvZNkx5rsJ23o+InNCjiCh3cDvoi9mg6DN5bQK8Foy
 Iqd2MTgxrMaF/FUbdGLdnFX4GQkgFPng8TpdX8sqqm1JHUprpqg=
 =nd41
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:

 - fix a use after free in nfs_direct_join_group() (Cc: stable)

 - fix sysfs server name memory leak

 - fix lock recovery hang in NFSv4.0

 - fix page free in the error path for nfs42_proc_getxattr() and
   __nfs4_get_acl_uncached()

 - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect

* tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  xprtrdma: Remap Receive buffers after a reconnect
  NFSv4: fix out path in __nfs4_get_acl_uncached
  NFSv4.2: fix error handling in nfs42_proc_getxattr
  NFS: Fix sysfs server name memory leak
  NFS: Fix a use after free in nfs_direct_join_group()
  NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22 10:50:17 -07:00
Chuck Lever
895cedc179 xprtrdma: Remap Receive buffers after a reconnect
On server-initiated disconnect, rpcrdma_xprt_disconnect() was DMA-
unmapping the Receive buffers, but rpcrdma_post_recvs() neglected
to remap them after a new connection had been established. The
result was immediate failure of the new connection with the Receives
flushing with LOCAL_PROT_ERR.

Fixes: 671c450b6f ("xprtrdma: Fix oops in Receive handler after device removal")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-08-19 10:26:29 -04:00
Chuck Lever
88770b8de3 svcrdma: Fix stale comment
Commit 7d81ee8722 ("svcrdma: Single-stage RDMA Read") changed the
behavior of svc_rdma_recvfrom() but neglected to update the
documenting comment.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:09:08 -04:00
Chuck Lever
b55c63332e svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()
Clean up.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:07 -04:00
Chuck Lever
a23c76e92d svcrdma: trace cc_release calls
This event brackets the svcrdma_post_* trace points. If this trace
event is enabled but does not appear as expected, that indicates a
chunk_ctxt leak.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever
91f8ce2846 svcrdma: Convert "might sleep" comment into a code annotation
Try to catch incorrect calling contexts mechanically rather than by
code review.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever
5581cf8efc SUNRPC: Optimize page release in svc_rdma_sendto()
Now that we have bulk page allocation and release APIs, it's more
efficient to use those than it is for nfsd threads to wait for send
completions. Previous patches have eliminated the calls to
wait_for_completion() and complete(), in order to avoid scheduler
overhead.

Now release pages-under-I/O in the send completion handler using
the efficient bulk release API.

I've measured a 7% reduction in cumulative CPU utilization in
svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In
particular, using release_pages() instead of complete() cuts the
time per svc_rdma_wc_send() call by two-thirds. This helps improve
scalability because svc_rdma_wc_send() is single-threaded per
connection.

Reviewed-by: Tom Talpey <tom@talpey.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Chuck Lever
baf6d18b11 svcrdma: Prevent page release when nothing was received
I noticed that svc_rqst_release_pages() was still unnecessarily
releasing a page when svc_rdma_recvfrom() returns zero.

Fixes: a53d5cb064 ("svcrdma: Avoid releasing a page in svc_xprt_release()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:04 -04:00
Chuck Lever
c4b50cdf9d svcrdma: Revert 2a1e4f21d8 ("svcrdma: Normalize Send page handling")
Get rid of the completion wait in svc_rdma_sendto(), and release
pages in the send completion handler again. A subsequent patch will
handle releasing those pages more efficiently.

Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:36 -04:00
Chuck Lever
a944209c11 SUNRPC: Revert 579900670a ("svcrdma: Remove unused sc_pages field")
Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:36 -04:00
Chuck Lever
6be7afcd92 SUNRPC: Revert cc93ce9529 ("svcrdma: Retain the page backing rq_res.head[0].iov_base")
Pre-requisite for releasing pages in the send completion handler.
Reverted by hand: patch -R would not apply cleanly.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever
ac3c32bbdb svcrdma: Clean up allocation of svc_rdma_rw_ctxt
The physical device's favored NUMA node ID is available when
allocating a rw_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever
ed51b42610 svcrdma: Clean up allocation of svc_rdma_send_ctxt
The physical device's favored NUMA node ID is available when
allocating a send_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever
c5d68d25bd svcrdma: Clean up allocation of svc_rdma_recv_ctxt
The physical device's favored NUMA node ID is available when
allocating a recv_ctxt. Use that value instead of relying on the
assumption that the memory allocation happens to be running on a
node close to the device.

This clean up eliminates the hack of destroying recv_ctxts that
were not created by the receive CQ thread -- recv_ctxts are now
always allocated on a "good" node.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:35 -04:00
Chuck Lever
fe2b401e55 svcrdma: Allocate new transports on device's NUMA node
The physical device's NUMA node ID is available when allocating an
svc_xprt for an incoming connection. Use that value to ensure the
svc_xprt structure is allocated on the NUMA node closest to the
device.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:34 -04:00
Linus Torvalds
1b66c114d1 nfsd-6.4 fixes:
- A collection of minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRiQVEACgkQM2qzM29m
 f5eYfBAAg5Qz45PL+fo1qWxkJ1ZKaNV1vPdi4tqCt9NEItDTjAnjj0am+rKNGZAz
 EOM2yFt4xaZGyMYgXe4VnYl0N+rSpbI/H+Rk/wOq4OHPURQD5EO9VeP86qZ7rmGl
 ECPqb39TFTwAiRomC/DHO4eNpoe1rQuXu0tW9+GmqDGxeuh8xdxTk33g17ZXwCFN
 tdPkkPjxVPdWd8X7HQg9kWm8AWfV+GyuzE2rKAoOjbs6Wv6d9GCY8Cb5HXkRsQhF
 4Zh0PVQuTuXurZwtPXwnS0k4kfvQwjlTIKHlXuo0ZLh+SuFbrWHzv0fVyD+kUpSK
 HtWbJ8JcruUvz0WGMtZatzRLHCZLDguV6oVXPp7rtmuxTj4szzHSFpEeAV901sIm
 Nkvuomvd02K/fiTo7s3yr6t1VG2vju9LDwhBe197iA3leHAlockfbbxE3NJMGbzQ
 NoOPd+lu95cfsanOM1LZZLNfbLrZofoSLK9K1+HD0yAVdyq7u47FyHRrymvCaMrj
 GiheuqrBfBMEq+2mCwUn37aM0FblYEXQM0xTVXPQcHtBBN/nGZxPJukmpr7ScNlR
 aqMtDoOLu4OEFuo6fe2/94eNi+N5XAZgWmx/mSyaytE8Xw9LJxeQ83UTigaGcYKc
 3YIuG1YXg9IyIoIdLkghB+Aj/6fivsGFK9Gud6g7I3xw4f15noA=
 =PiNG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - A collection of minor bug fixes

* tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Remove open coding of string copy
  SUNRPC: Fix trace_svc_register() call site
  SUNRPC: always free ctxt when freeing deferred request
  SUNRPC: double free xprt_ctxt while still in use
  SUNRPC: Fix error handling in svc_setup_socket()
  SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
  lockd: define nlm_port_min,max with CONFIG_SYSCTL
  nfsd: define exports_proc_ops with CONFIG_PROC_FS
  SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17 09:56:01 -07:00
NeilBrown
948f072ada SUNRPC: always free ctxt when freeing deferred request
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not
been sufficient to use kfree() to free a deferred request.  We may need
to free the ctxt as well.

As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose
it to explicit do that even when the ctxt is not stored in an rqst.
So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt,
which may have been taken either from an rqst or from a dreq.  The
caller is now responsible for clearing that pointer after the call to
->xpo_release_ctxt.

We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when
revisiting a deferred request.  This ensures there is only one pointer
to the ctxt, so the risk of double freeing in future is reduced.  The
new code in svc_xprt_release which releases both the ctxt and any
rq_deferred depends on this.

Fixes: 773f91b2cf ("SUNRPC: Fix NFSD's request deferral on RDMA transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14 15:55:02 -04:00
Linus Torvalds
4e1c80ae5c NFSD 6.4 Release Notes
The big ticket item for this release is support for RPC-with-TLS
 [RFC 9289] has been added to the Linux NFS server. The goal is to
 provide a simple-to-deploy, low-overhead in-transit confidentiality
 and peer authentication mechanism. It can supplement NFS Kerberos
 and it can protect the use of legacy non-cryptographic user
 authentication flavors such as AUTH_SYS. The TLS Record protocol is
 handled entirely by kTLS, meaning it can use either software
 encryption or offload encryption to smart NICs.
 
 Work continues on improving NFSD's open file cache. Among the many
 clean-ups in that area is a patch to convert the rhashtable to use
 the list-hashing version of that data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
 f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
 06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
 /LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
 Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
 /qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
 J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
 Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
 1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
 R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
 C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
 ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
 =lA2+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The big ticket item for this release is that support for RPC-with-TLS
  [RFC 9289] has been added to the Linux NFS server.

  The goal is to provide a simple-to-deploy, low-overhead in-transit
  confidentiality and peer authentication mechanism. It can supplement
  NFS Kerberos and it can protect the use of legacy non-cryptographic
  user authentication flavors such as AUTH_SYS. The TLS Record protocol
  is handled entirely by kTLS, meaning it can use either software
  encryption or offload encryption to smart NICs.

  Aside from that, work continues on improving NFSD's open file cache.
  Among the many clean-ups in that area is a patch to convert the
  rhashtable to use the list-hashing version of that data structure"

* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  NFSD: Handle new xprtsec= export option
  SUNRPC: Support TLS handshake in the server-side TCP socket code
  NFSD: Clean up xattr memory allocation flags
  NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
  SUNRPC: Clear rq_xid when receiving a new RPC Call
  SUNRPC: Recognize control messages in server-side TCP socket code
  SUNRPC: Be even lazier about releasing pages
  SUNRPC: Convert svc_xprt_release() to the release_pages() API
  SUNRPC: Relocate svc_free_res_pages()
  nfsd: simplify the delayed disposal list code
  SUNRPC: Ignore return value of ->xpo_sendto
  SUNRPC: Ensure server-side sockets have a sock->file
  NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
  sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
  SUNRPC: return proper error from get_expiry()
  lockd: add some client-side tracepoints
  nfs: move nfs_fhandle_hash to common include file
  lockd: server should unlock lock if client rejects the grant
  lockd: fix races in client GRANTED_MSG wait logic
  lockd: move struct nlm_wait to lockd.h
  ...
2023-04-29 11:04:14 -07:00
Luis Chamberlain
376bcd9b37 sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
Luis Chamberlain
17c6d0ce83 sunrpc: simplify one-level sysctl registration for xr_tunables_table
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 12:44:49 -04:00
Chuck Lever
319951eba0 SUNRPC: Remove ->xpo_secure_port()
There's no need for the cost of this extra virtual function call
during every RPC transaction: the RQ_SECURE bit can be set properly
in ->xpo_recvfrom() instead.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20 09:20:55 -05:00
Linus Torvalds
7dd4b804e0 nfsd-6.2 fixes:
- Fix a race when creating NFSv4 files
 - Revert the use of relaxed bitops
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmO9twgACgkQM2qzM29m
 f5cQ/A//SSv/eZl2cnAMZtN1zd7wIMfI6E9y8Ccv49aebUXGmGDKSwz/CUns2sgO
 avWentInUYg2cexIjaQnQeGkiQt0Do+3u/cdT86h2e8q3UhvctYWO5uRCqbP+36H
 JRLfNUUbic4P8Yp/LZ5DvwOWae4PLdZq71mxJkaTXGHt8zLn/yEntCY8jb6V7D2L
 SxMXAoO05bdzfPc8lXKmaGi4JMsANEOMh5ZMRpKxKTEFQG352db17MqwOAW/Qe+t
 mMXY2jRfeufFwimmwLK06EzItgcs6D9g7dM3oIwDUNiPL4l3lOYeynbYOref7fD3
 4u11LwZdzZ5LYIZ0HoTpRu3ZxAbrTtmd1FiT7SwN9jjq1vu0Zx0sfqk0R9VixY3c
 jP+wYKEDTQUkIVdbG6g/u6yQZvwM281+GiAXoD3FJWKJDwAaqwxd6cphCn314RKY
 hlgG4DGhAi0BYbiLVu5ObQwRb1yPgCP2pXqguAdAKbTM2DVC2+hAW3NDUcIKrR1U
 JoXmGBaWeuJU9/0JbfVzddXUCs227hnovj1nmGW7E8JUegW4m+3oscEP8tsC5H5S
 J3Jr9ovxyYGQE1qxM5909hjPjrZxI3NszKIpgWoo9/jJLUWfGtnS2BclrXUxQrdl
 rvbKHvmSLyOsFYnZ5Nt7uj1l7LtWMljrjOjPqe02iU6pRDNHa9Y=
 =/7AX
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a race when creating NFSv4 files

 - Revert the use of relaxed bitops

* tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Use set_bit(RQ_DROPME)
  Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
  nfsd: fix handling of cached open files in nfsd4_open codepath
2023-01-10 15:03:06 -06:00
Chuck Lever
7827c81f02 Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
The premise that "Once an svc thread is scheduled and executing an
RPC, no other processes will touch svc_rqst::rq_flags" is false.
svc_xprt_enqueue() examines the RQ_BUSY flag in scheduled nfsd
threads when determining which thread to wake up next.

Found via KCSAN.

Fixes: 28df098881 ("SUNRPC: Use RMW bitops in single-threaded hot paths")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-01-06 13:17:12 -05:00
Zhang Xiaoxu
9181f40fb2 xprtrdma: Fix regbuf data not freed in rpcrdma_req_create()
If rdma receive buffer allocate failed, should call rpcrdma_regbuf_free()
to free the send buffer, otherwise, the buffer data will be leaked.

Fixes: bb93a1ae2b ("xprtrdma: Allocate req's regbufs at xprt create time")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-12-06 12:19:48 -05:00
Chuck Lever
e4266f23ec xprtrdma: Fix uninitialized variable
net/sunrpc/xprtrdma/frwr_ops.c:151:32: warning: variable 'rc' is uninitialized when used here [-Wuninitialized]
          trace_xprtrdma_frwr_alloc(mr, rc);
                                        ^~
  net/sunrpc/xprtrdma/frwr_ops.c:127:8: note: initialize the variable 'rc' to silence this warning
          int rc;
                ^
                 = 0
  1 warning generated.

The tracepoint is intended to record the error returned from
ib_alloc_mr(). In the current code there is no other purpose for
@rc, so simply replace it.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: d8cf39a280c3b0 ('xprtrdma: MR-related memory allocation should be allowed to fail')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:17 -04:00
Chuck Lever
f20f18c956 xprtrdma: Prevent memory allocations from driving a reclaim
Many memory allocations that xprtrdma does can fail safely. Let's
use this fact to avoid some potential deadlocks: Replace GFP_KERNEL
with GFP flags that do not try hard to acquire memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:17 -04:00
Chuck Lever
9c8f332fbf xprtrdma: Memory allocation should be allowed to fail during connect
An attempt to establish a connection can always fail and then be
retried. GFP_KERNEL allocation is not necessary here.

Like MR allocation, establishing a connection is always done in a
worker thread. The new GFP flags align with the flags that would
be returned by rpc_task_gfp_mask() in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Chuck Lever
2d77058cce xprtrdma: MR-related memory allocation should be allowed to fail
xprtrdma always drives a retry of MR allocation if it should fail.
It should be safe to not use GFP_KERNEL for this purpose rather
than sleeping in the memory allocator.

In theory, if these weaker allocations are attempted first, memory
exhaustion is likely to cause xprtrdma to fail fast and not then
invoke the RDMA core APIs, which still might use GFP_KERNEL.

Also note that rpc_task_gfp_mask() always sets __GFP_NORETRY and
__GFP_NOWARN when an RPC-related allocation is being done in a
worker thread. MR allocation is already always done in worker
threads.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Chuck Lever
7ac1879875 xprtrdma: Clean up synopsis of rpcrdma_regbuf_alloc()
Currently all rpcrdma_regbuf_alloc() call sites pass the same value
as their third argument. That argument can therefore be eliminated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Chuck Lever
3b50cc1c7f xprtrdma: Clean up synopsis of rpcrdma_req_create()
Commit 1769e6a816 ("xprtrdma: Clean up rpcrdma_create_req()")
added rpcrdma_req_create() with a GFP flags argument in case a
caller might want to avoid waiting for memory.

There has never been a caller that does not pass GFP_KERNEL as
the third argument. That argument can therefore be eliminated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Chuck Lever
5014831264 svcrdma: Clean up RPCRDMA_DEF_GFP
xprt_rdma_bc_allocate() is now the only user of RPCRDMA_DEF_GFP.
Replace that macro with the raw flags.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Chuck Lever
6b1eb3b222 SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:

kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  __flush_work.isra.0+0xaf/0x188
kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
kernel:  ? lock_timer_base+0x38/0x5f
kernel:  __cancel_work_timer+0xea/0x13d
kernel:  ? preempt_latency_start+0x2b/0x46
kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel:  ? _raw_spin_unlock+0x14/0x29
kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
kernel:  ? finish_task_switch.isra.0+0x171/0x249
kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel:  process_one_work+0x1d8/0x2d4
kernel:  worker_thread+0x18b/0x24f
kernel:  ? rescuer_thread+0x280/0x280
kernel:  kthread+0xf4/0xfc
kernel:  ? kthread_complete_and_exit+0x1b/0x1b
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>

SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.

Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.

So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.

Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-10-05 15:47:16 -04:00
Trond Myklebust
4b8dbdfbc5 SUNRPC: Fix an RPC/RDMA performance regression
Use the standard gfp mask instead of using GFP_NOWAIT. The latter causes
issues when under memory pressure.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-07-10 19:00:53 -04:00