13945 Commits

Author SHA1 Message Date
Bob Pearson
86a3fb55bc RDMA/rxe: Let rkey == lkey for local access
In order to conform to other drivers stop using rkey == 0 as an indication
that there are no remote access flags set.  Set rkey == lkey by default
for all MRs.

Link: https://lore.kernel.org/r/20230530221334.89432-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:53 -03:00
Bob Pearson
02ed253770 RDMA/rxe: Introduce rxe access supported flags
Introduce supported bit masks for setting the access attributes of MWs,
MRs, and QPs. Check these when attributes are set.

Link: https://lore.kernel.org/r/20230530221334.89432-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:53 -03:00
Bob Pearson
425e1c9018 RDMA/rxe: Fix access checks in rxe_check_bind_mw
The subroutine rxe_check_bind_mw() in rxe_mw.c performs checks on the mw
access flags before they are set so they always succeed.  This patch
instead checks the access flags passed in the send wqe.

Fixes: 32a577b4c3a9 ("RDMA/rxe: Add support for bind MW work requests")
Link: https://lore.kernel.org/r/20230530221334.89432-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Bob Pearson
2a129958bd RDMA//rxe: Optimize send path in rxe_resp.c
Bypass calling check_rkey() in rxe_resp.c for non-rdma messages.

Link: https://lore.kernel.org/r/20230530221334.89432-3-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Bob Pearson
d11442c6bd RDMA/rxe: Rename IB_ACCESS_REMOTE
Rename IB_ACCESS_REMOTE to RXE_ACCESS_REMOTE and move to rxe_verbs.h as an
enum instead of a #define. Shouldn't use IB_xxx for rxe symbols.

Link: https://lore.kernel.org/r/20230530221334.89432-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-09 13:18:52 -03:00
Shay Drory
222dd18583 {net/RDMA}/mlx5: introduce lag_for_each_peer
Introduce a generic APIs to iterate over all the devices which are part
of the LAG. This API replace mlx5_lag_get_peer_mdev() which retrieve
only a single peer device from the lag.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07 14:00:42 -07:00
Shay Drory
962825e534 RDMA/mlx5: Free second uplink ib port
The cited patch introduce ib port for the slave device uplink in
case of multiport eswitch. However, this ib port didn't perform
anything when unloaded.
Unload the new ib port properly.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07 14:00:42 -07:00
Kamal Heib
18e7e3e421 RDMA/bnxt_re: Fix reporting active_{speed,width} attributes
After commit 6d758147c7b8 ("RDMA/bnxt_re: Use auxiliary driver interface")
the active_{speed, width} attributes are reported incorrectly, This is
happening because ib_get_eth_speed() is called only once from
bnxt_re_ib_init() - Fix this issue by calling ib_get_eth_speed() from
bnxt_re_query_port().

Fixes: 6d758147c7b8 ("RDMA/bnxt_re: Use auxiliary driver interface")
Link: https://lore.kernel.org/r/20230529153525.87254-1-kheib@redhat.com
Signed-off-by: Kamal Heib <kheib@redhat.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 20:01:29 -03:00
Chengchang Tang
a519a612a7 RDMA/hns: Add clear_hem return value to log
Log return value of clear_hem() to help diagnose.

Link: https://lore.kernel.org/r/20230523121641.3132102-4-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 19:58:47 -03:00
Chengchang Tang
cf5b608fb0 RDMA/hns: Fix hns_roce_table_get return value
The return value of set_hem has been fixed to ENODEV, which will lead a
diagnostic information missing.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/20230523121641.3132102-3-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 19:58:47 -03:00
Junxian Huang
b9989ab3f6 RDMA/hns: Remove unnecessary QP type checks
It is not necessary to check the type of the queue on IO path because
unsupported QP type cannot be created.

Link: https://lore.kernel.org/r/20230523121641.3132102-2-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 19:58:47 -03:00
Jakub Kicinski
a03a91bd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab656344a ("sfc: fix error unwinds in TC offload")
  b6583d5e9e94 ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d087 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc5b ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01 15:38:26 -07:00
Brendan Cunningham
95ea2efbd6 IB/hfi1: Remove unused struct mmu_rb_ops fields .insert, .invalidate
The struct mmu_rb_ops function pointers .insert, .invalidate were only
used to increment and decrement struct sdma_mmu_node.refcount.

With the deletion of struct sdma_mmu_node.refcount and the addition of
struct mmu_rb_node.refcount  these function pointers are not called and
there are no implementations of them. So it is safe to delete these from
struct mmu_rb_ops.

Link: https://lore.kernel.org/r/168451526508.3702129.8677714753157495310.stgit@awfm-02.cornelisnetworks.com
Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 14:45:58 -03:00
Brendan Cunningham
e236e2eae5 IB/hfi1: Add mmu_rb_node refcount to hfi1_mmu_rb_template tracepoints
Add kref_read() of mmu_rb_node.refcount in hfi1_mmu_rb_template-type
tracepoint output.

Change hfi1_mmu_rb_template tracepoint to take a struct mmu_rb_node* and
record the values it needs from that. This makes the trace_hfi1_mmu*()
calls shorter and easier to read.

Add hfi1_mmu_release_node() tracepoint before all
mmu_rb_node->handler->ops->remove() calls.

Make hfi1_mmu_rb_search() tracepoint its own tracepoint type separate
from hfi1_mmu_rb_template since hfi1_mmu_rb_search() does not take a
struct mmu_rb_node*.

Link: https://lore.kernel.org/r/168451525987.3702129.12824880387615916700.stgit@awfm-02.cornelisnetworks.com
Reviewed-by: Dean Luick <dean.luick@cornelisnetworks.com>
Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 14:45:57 -03:00
Brendan Cunningham
c9358de193 IB/hfi1: Fix wrong mmu_node used for user SDMA packet after invalidate
The hfi1 user SDMA pinned-page cache will leave a stale cache entry when
the cache-entry's virtual address range is invalidated but that cache
entry is in-use by an outstanding SDMA request.

Subsequent user SDMA requests with buffers in or spanning the virtual
address range of the stale cache entry will result in packets constructed
from the wrong memory, the physical pages pointed to by the stale cache
entry.

To fix this, remove mmu_rb_node cache entries from the mmu_rb_handler
cache independent of the cache entry's refcount. Add 'struct kref
refcount' to struct mmu_rb_node and manage mmu_rb_node lifetime with
kref_get() and kref_put().

mmu_rb_node.refcount makes sdma_mmu_node.refcount redundant. Remove
'atomic_t refcount' from struct sdma_mmu_node and change sdma_mmu_node
code to use mmu_rb_node.refcount.

Move the mmu_rb_handler destructor call after a
wait-for-SDMA-request-completion call so mmu_rb_nodes that need
mmu_rb_handler's workqueue to queue themselves up for destruction from an
interrupt context may do so.

Fixes: f48ad614c100 ("IB/hfi1: Move driver out of staging")
Fixes: 00cbce5cbf88 ("IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests")
Link: https://lore.kernel.org/r/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com
Reviewed-by: Dean Luick <dean.luick@cornelisnetworks.com>
Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 14:38:00 -03:00
Bob Pearson
b00683422f RDMA/rxe: Fix ref count error in check_rkey()
There is a reference count error in error path code and a potential race
in check_rkey() in rxe_resp.c. When looking up the rkey for a memory
window the reference to the mw from rxe_lookup_mw() is dropped before a
reference is taken on the mr referenced by the mw. If the mr is destroyed
immediately after the call to rxe_put(mw) the mr pointer is unprotected
and may end up pointing at freed memory. The rxe_get(mr) call should take
place before the rxe_put(mw) call.

All errors in check_rkey() call rxe_put(mw) if mw is not NULL but it was
already called after the above. The mw pointer should be set to NULL after
the rxe_put(mw) call to prevent this from happening.

Fixes: cdd0b85675ae ("RDMA/rxe: Implement memory access through MWs")
Link: https://lore.kernel.org/r/20230517211509.1819998-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 14:27:36 -03:00
Bob Pearson
9a3763e873 RDMA/rxe: Fix packet length checks
In rxe_net.c a received packet, from udp or loopback, is passed to
rxe_rcv() in rxe_recv.c as a udp packet. I.e. skb->data is pointing at the
udp header. But rxe_rcv() makes length checks to verify the packet is long
enough to hold the roce headers as if it were a roce
packet. I.e. skb->data pointing at the bth header. A runt packet would
appear to have 8 more bytes than it actually does which may lead to
incorrect behavior.

This patch calls skb_pull() to adjust the skb to point at the bth header
before calling rxe_rcv() which fixes this error.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20230517172242.1806340-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 14:27:25 -03:00
Li Zhijian
9c29c8c7df RDMA/rtrs: Fix rxe_dealloc_pd warning
In current design:
1. PD and clt_path->s.dev are shared among connections.
2. every con[n]'s cleanup phase will call destroy_con_cq_qp()
3. clt_path->s.dev will be always decreased in destroy_con_cq_qp(), and
   when clt_path->s.dev become zero, it will destroy PD.
4. when con[1] failed to create, con[1] will not take clt_path->s.dev,
   but it try to decreased clt_path->s.dev

So, in case create_cm(con[0]) succeeds but create_cm(con[1]) fails,
destroy_con_cq_qp(con[1]) will be called first which will destroy the PD
while this PD is still taken by con[0].

Here, we refactor the error path of create_cm() and init_conns(), so that
we do the cleanup in the order they are created.

The warning occurs when destroying RXE PD whose reference count is not
zero.

 rnbd_client L597: Mapping device /dev/nvme0n1 on session client, (access_mode: rw, nr_poll_queues: 0)
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 26407 at drivers/infiniband/sw/rxe/rxe_pool.c:256 __rxe_cleanup+0x13a/0x170 [rdma_rxe]
 Modules linked in: rpcrdma rdma_ucm ib_iser rnbd_client libiscsi rtrs_client scsi_transport_iscsi rtrs_core rdma_cm iw_cm ib_cm crc32_generic rdma_rxe udp_tunnel ib_uverbs ib_core kmem device_dax nd_pmem dax_pmem nd_vme crc32c_intel fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log dm_mod
 CPU: 0 PID: 26407 Comm: rnbd-client.sh Kdump: loaded Not tainted 6.2.0-rc6-roce-flush+ #53
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
 RIP: 0010:__rxe_cleanup+0x13a/0x170 [rdma_rxe]
 Code: 45 84 e4 0f 84 5a ff ff ff 48 89 ef e8 5f 18 71 f9 84 c0 75 90 be c8 00 00 00 48 89 ef e8 be 89 1f fa 85 c0 0f 85 7b ff ff ff <0f> 0b 41 bc ea ff ff ff e9 71 ff ff ff e8 84 7f 1f fa e9 d0 fe ff
 RSP: 0018:ffffb09880b6f5f0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff99401f15d6a8 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffffffffbac8234b RDI: 00000000ffffffff
 RBP: ffff99401f15d6d0 R08: 0000000000000001 R09: 0000000000000001
 R10: 0000000000002d82 R11: 0000000000000000 R12: 0000000000000001
 R13: ffff994101eff208 R14: ffffb09880b6f6a0 R15: 00000000fffffe00
 FS:  00007fe113904740(0000) GS:ffff99413bc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ff6cde656c8 CR3: 000000001f108004 CR4: 00000000001706f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  rxe_dealloc_pd+0x16/0x20 [rdma_rxe]
  ib_dealloc_pd_user+0x4b/0x80 [ib_core]
  rtrs_ib_dev_put+0x79/0xd0 [rtrs_core]
  destroy_con_cq_qp+0x8a/0xa0 [rtrs_client]
  init_path+0x1e7/0x9a0 [rtrs_client]
  ? __pfx_autoremove_wake_function+0x10/0x10
  ? lock_is_held_type+0xd7/0x130
  ? rcu_read_lock_sched_held+0x43/0x80
  ? pcpu_alloc+0x3dd/0x7d0
  ? rtrs_clt_init_stats+0x18/0x40 [rtrs_client]
  rtrs_clt_open+0x24f/0x5a0 [rtrs_client]
  ? __pfx_rnbd_clt_link_ev+0x10/0x10 [rnbd_client]
  rnbd_clt_map_device+0x6a5/0xe10 [rnbd_client]

Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/1682384563-2-4-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Tested-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 13:13:02 -03:00
Li Zhijian
3bf3a7c698 RDMA/rtrs: Fix the last iu->buf leak in err path
The last iu->buf will leak if ib_dma_mapping_error() fails.

Fixes: c0894b3ea69d ("RDMA/rtrs: core: lib functions shared between client and server modules")
Link: https://lore.kernel.org/r/1682384563-2-3-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 13:13:02 -03:00
Li Zhijian
1cc625cecc RDMA/rtrs: Remove duplicate cq_num assignment
line 1701 and 1713 are duplicate:
> 1701         cq_num = max_send_wr + max_recv_wr;
  1702         /* alloc iu to recv new rkey reply when server reports flags set */
  1703         if (clt_path->flags & RTRS_MSG_NEW_RKEY_F || con->c.cid == 0) {
  1704                 con->rsp_ius = rtrs_iu_alloc(cq_num, sizeof(*rsp),
  1705                                               GFP_KERNEL,
  1706                                               clt_path->s.dev->ib_dev,
  1707                                               DMA_FROM_DEVICE,
  1708                                               rtrs_clt_rdma_done);
  1709                 if (!con->rsp_ius)
  1710                         return -ENOMEM;
  1711                 con->queue_num = cq_num;
  1712         }
> 1713         cq_num = max_send_wr + max_recv_wr;

Remove the duplicate.

Link: https://lore.kernel.org/r/1682384563-2-2-git-send-email-lizhijian@fujitsu.com
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 13:11:41 -03:00
Nicolas Morey
84510a61ef RDMA/rxe: Remove dangling declaration of rxe_cq_disable()
rxe_cq_disable() has been removed but not its declaration.

Fixes: 78b26a335310 ("RDMA/rxe: Remove tasklet call from rxe_cq.c")
Link: https://lore.kernel.org/r/4f20ffc5-b2c4-0c11-2883-a835caf01a94@suse.com
Signed-off-by: Nicolas Morey <nmorey@suse.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 12:59:32 -03:00
Arnd Bergmann
b002760f87 RDMA/irdma: avoid fortify-string warning in irdma_clr_wqes
Commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") triggers a
warning for fortified memset():

In function 'fortify_memset_chk',
    inlined from 'irdma_clr_wqes' at drivers/infiniband/hw/irdma/uk.c:103:4:
include/linux/fortify-string.h:493:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
  493 |                         __write_overflow_field(p_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The problem here isthat the inner array only has four 8-byte elements, so
clearing 4096 bytes overflows that. As this structure is part of an outer
array, change the code to pass a pointer to the irdma_qp_quanta instead,
and change the size argument for readability, matching the comment above
it.

Fixes: 551c46edc769 ("RDMA/irdma: Add user/kernel shared libraries")
Link: https://lore.kernel.org/r/20230523111859.2197825-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 12:59:12 -03:00
Long Li
2145328515 RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescing
With RX coalescing, one CQE entry can be used to indicate multiple packets
on the receive queue. This saves processing time and PCI bandwidth over
the CQ.

The MANA Ethernet driver also uses the v2 version of the protocol. It
doesn't use RX coalescing and its behavior is not changed.

Link: https://lore.kernel.org/r/1684045095-31228-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-06-01 12:52:01 -03:00
Mustafa Ismail
5842d1d9c1 RDMA/irdma: Fix Local Invalidate fencing
If the local invalidate fence is indicated in the WR, only the read fence
is currently being set in WQE. Fix this to set both the read and local
fence in the WQE.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20230522155654.1309-4-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-29 14:06:29 -03:00
Mustafa Ismail
c8f304d75f RDMA/irdma: Prevent QP use after free
There is a window where the poll cq may use a QP that has been freed.
This can happen if a CQE is polled before irdma_clean_cqes() can clear the
CQE's related to the QP and the destroy QP races to free the QP memory.
then the QP structures are used in irdma_poll_cq.  Fix this by moving the
clearing of CQE's before the reference is removed and the QP is destroyed.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20230522155654.1309-3-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-29 14:06:29 -03:00
David Howells
c2ff29e99a siw: Inline do_tcp_sendpages()
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Leon Romanovsky <leon@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23 20:48:27 -07:00
Kalesh AP
8c1ee346da RDMA/bnxt_re: Remove unnecessary checks
The NULL check inside bnxt_qplib_del_sgid() and bnxt_qplib_add_sgid()
always return false as the "sgid_tbl" inside "rdev->qplib_res" is a static
memory.

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684478897-12247-8-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:19 -03:00
Kalesh AP
07d5ce14b2 RDMA/bnxt_re: Return directly without goto jumps
When there is no cleanup to be done, return directly.  This will help
eliminating unnecessary local variables and goto labels.  This patch fixes
such occurrences in qplib_fp.c file.

Fixes: 37cb11acf1f7 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Fixes: 159fb4ceacd7 ("RDMA/bnxt_re: introduce a function to allocate swq")
Link: https://lore.kernel.org/r/1684478897-12247-7-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:19 -03:00
Kalesh AP
43774bc156 RDMA/bnxt_re: Fix to remove an unnecessary log
During destroy_qp, driver sets the qp handle in the existing CQEs
belonging to the QP being destroyed to NULL. As a result, a poll_cq after
destroy_qp can report unnecessary messages.  Remove this noise from system
logs.

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684478897-12247-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:18 -03:00
Kalesh AP
b989f90cef RDMA/bnxt_re: Remove a redundant check inside bnxt_re_update_gid
The NULL check inside bnxt_re_update_gid() always return false.  If
sgid_tbl->tbl is not allocated, then dev_init would have failed.

Fixes: 5fac5b1b297f ("RDMA/bnxt_re: Add vlan tag for untagged RoCE traffic when PFC is configured")
Link: https://lore.kernel.org/r/1684478897-12247-5-git-send-email-selvin.xavier@broadcom.com
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:18 -03:00
Kalesh AP
ff2e4bfd16 RDMA/bnxt_re: Use unique names while registering interrupts
bnxt_re currently uses the names "bnxt_qplib_creq" and "bnxt_qplib_nq-0"
while registering IRQs. There is no way to distinguish the IRQs of
different device ports when there are multiple IB devices registered.
This could make the scenarios worse where one want to pin IRQs of a device
port to certain CPUs.

Fixed the code to use unique names which has PCI BDF information while
registering interrupts like: "bnxt_re-nq-0@pci:0000:65:00.0" and
"bnxt_re-creq@pci:0000:65:00.1".

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684478897-12247-4-git-send-email-selvin.xavier@broadcom.com
Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:18 -03:00
Kalesh AP
9b3ee47796 RDMA/bnxt_re: Fix to remove unnecessary return labels
If there is no cleanup needed then just return directly.  This cleans up
the code and improve readability.

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684478897-12247-3-git-send-email-selvin.xavier@broadcom.com
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:18 -03:00
Selvin Xavier
ab112ee789 RDMA/bnxt_re: Disable/kill tasklet only if it is enabled
When the ulp hook to start the IRQ fails because the rings are not
available, tasklets are not enabled. In this case when the driver is
unloaded, driver calls CREQ tasklet_kill. This causes an indefinite hang
as the tasklet is not enabled.

Driver shouldn't call tasklet_kill if it is not enabled. So using the
creq->requested and nq->requested flags to identify if both tasklets/irqs
are registered. Checking this flag while scheduling the tasklet from
ISR. Also, added a cleanup for disabling tasklet, in case request_irq
fails during start_irq.

Check for return value for bnxt_qplib_rcfw_start_irq and in case the
bnxt_qplib_rcfw_start_irq fails, return bnxt_re_start_irq without
attempting to start NQ IRQs.

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684478897-12247-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:26:18 -03:00
Kalesh AP
dd5fb04857 RDMA/bnxt_re: Do not enable congestion control on VFs
Congestion control needs to be enabled only on the PFs. FW fails the
command if issued on VFs. Avoid sending the command on VFs.

Fixes: f13bcef04ba0 ("RDMA/bnxt_re: Enable congestion control by default")
Link: https://lore.kernel.org/r/1684397461-23082-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:14:27 -03:00
Kalesh AP
0fa0d520e2 RDMA/bnxt_re: Fix return value of bnxt_re_process_raw_qp_pkt_rx
bnxt_re_process_raw_qp_pkt_rx() always return 0 and ignores the return
value of bnxt_re_post_send_shadow_qp().

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684397461-23082-3-git-send-email-selvin.xavier@broadcom.com
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:14:19 -03:00
Kalesh AP
349e3c0cf2 RDMA/bnxt_re: Fix a possible memory leak
Inside bnxt_qplib_create_cq(), when the check for NULL DPI fails, driver
returns directly without freeing the memory allocated inside
bnxt_qplib_alloc_init_hwq() routine.

Fixed this by moving the check for NULL DPI before invoking
bnxt_qplib_alloc_init_hwq().

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1684397461-23082-2-git-send-email-selvin.xavier@broadcom.com
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 13:09:25 -03:00
Daisuke Matsuda
42b0a5e691 RDMA/rxe: Fix comments about removed tasklets
The commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
removed tasklets and replaced them with a workqueue, but relevant comments
are still remaining in the source code.

Fixes: 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
Link: https://lore.kernel.org/r/20230518070027.942715-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-19 12:02:26 -03:00
Kamal Heib
a7dae5daf4 RDMA/irdma: Move iw device ops initialization
Move the initialization of the iw device ops to be under the declaration
of the irdma_iw_dev_ops.

Link: https://lore.kernel.org/r/20230515191142.413633-4-kheib@redhat.com
Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:31:37 -03:00
Kamal Heib
bc89be9443 RDMA/irdma: Return void from irdma_init_rdma_device()
The return value from irdma_init_rdma_device() is always 0 - change it to
be void.

Link: https://lore.kernel.org/r/20230515191142.413633-3-kheib@redhat.com
Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:31:37 -03:00
Kamal Heib
ab4e8fc174 RDMA/irdma: Return void from irdma_init_iw_device()
The return value from irdma_init_iw_device() is always 0 - change it to be
void.

Link: https://lore.kernel.org/r/20230515191142.413633-2-kheib@redhat.com
Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:31:37 -03:00
Yangyang Li
56518a603f RDMA/hns: Modify the value of long message loopback slice
Long message loopback slice is used for achieving traffic balance between
QPs. It prevents the problem that QPs with large traffic occupying the
hardware pipeline for a long time and QPs with small traffic cannot be
scheduled.

Currently, its maximum value is set to 16K, which means only after a QP
sends 16K will the second QP be scheduled. This value is too large, which
will lead to unbalanced traffic scheduling, and thus it needs to be
modified.

The setting range of the long message loopback slice is modified to be
from 1024 (the lower limit supported by hardware) to mtu. Actual testing
shows that this value can significantly reduce error in hardware traffic
scheduling.

This solution is compatible with both HIP08 and HIP09. The modified
lp_pktn_ini has a maximum value of 2 (when mtu is 256), so the range
checking code for lp_pktn_ini is no longer necessary and needs to be
deleted.

Fixes: 0e60778efb07 ("RDMA/hns: Modify the value of MAX_LP_MSG_LEN to meet hardware compatibility")
Link: https://lore.kernel.org/r/20230512092245.344442-4-huangjunxian6@hisilicon.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:05:00 -03:00
Chengchang Tang
7f3969b14f RDMA/hns: Fix base address table allocation
For hns, the specification of an entry like resource (E.g. WQE/CQE/EQE)
depends on BT page size, buf page size and hopnum. For user mode, the buf
page size depends on UMEM. Therefore, the actual specification is
controlled by BT page size and hopnum.

The current BT page size and hopnum are obtained from firmware. This makes
the driver inflexible and introduces unnecessary constraints.  Resource
allocation failures occur in many scenarios.

This patch will calculate whether the BT page size set by firmware is
sufficient before allocating BT, and increase the BT page size if it is
insufficient.

Fixes: 1133401412a9 ("RDMA/hns: Optimize base address table config flow for qp buffer")
Link: https://lore.kernel.org/r/20230512092245.344442-3-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:05:00 -03:00
Chengchang Tang
58caa2a51a RDMA/hns: Fix timeout attr in query qp for HIP08
On HIP08, the queried timeout attr is different from the timeout attr
configured by the user.

It is found by rdma-core testcase test_rdmacm_async_traffic:

======================================================================
FAIL: test_rdmacm_async_traffic (tests.test_rdmacm.CMTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./tests/test_rdmacm.py", line 33, in test_rdmacm_async_traffic
    self.two_nodes_rdmacm_traffic(CMAsyncConnection, self.rdmacm_traffic,
  File "./tests/base.py", line 382, in two_nodes_rdmacm_traffic
    raise(res)
AssertionError

Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC")
Link: https://lore.kernel.org/r/20230512092245.344442-2-huangjunxian6@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 16:05:00 -03:00
Bob Pearson
9b4b7c1f9f RDMA/rxe: Add workqueue support for rxe tasks
Replace tasklets by work queues for the three main rxe tasklets:
rxe_requester, rxe_completer and rxe_responder.

work queues are a more modern way to process work from an IRQ and provide
more control over how that work is run for future patches.

Link: https://lore.kernel.org/r/20230428171321.5774-1-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-17 15:34:25 -03:00
Yonatan Nachum
866422cddd RDMA/efa: Fix unsupported page sizes in device
Device uses 4KB size blocks for user pages indirect list while the
driver creates those blocks with the size of PAGE_SIZE of the kernel. On
kernels with PAGE_SIZE different than 4KB (ARM RHEL), this leads to a
failure on register MR with indirect list because of the miss
communication between driver and device.

Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation")
Link: https://lore.kernel.org/r/20230511115103.13876-1-ynachum@amazon.com
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-16 21:13:22 -03:00
Guoqing Jiang
b5f3fe27c5 RDMA/rxe: Convert spin_{lock_bh,unlock_bh} to spin_{lock_irqsave,unlock_irqrestore}
We need to call spin_lock_irqsave()/spin_unlock_irqrestore() for
state_lock in rxe, otherwsie the callchain:

  ib_post_send_mad
	-> spin_lock_irqsave
	-> ib_post_send -> rxe_post_send
				-> spin_lock_bh
				-> spin_unlock_bh
	-> spin_unlock_irqrestore

Causes below traces during run block nvmeof-mp/001 test due to mismatched
spinlock nesting:

  WARNING: CPU: 0 PID: 94794 at kernel/softirq.c:376 __local_bh_enable_ip+0xc2/0x140
  [ ... ]
  CPU: 0 PID: 94794 Comm: kworker/u4:1 Tainted: G            E      6.4.0-rc1 #9
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
  Workqueue: rdma_cm cma_work_handler [rdma_cm]
  RIP: 0010:__local_bh_enable_ip+0xc2/0x140
  Code: 48 85 c0 74 72 5b 41 5c 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89 c0 e9 bd 0e 11 01 65 8b 05 f2 65 72 48 85 c0 0f 85 76 ff ff ff <0f> 0b e9 6f ff ff ff e8 d2 39 1c 00 eb 80 4c 89 e7 e8 68 ad 0a 00
  RSP: 0018:ffffb7cf818539f0 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 0000000000000201 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffc0f25f79
  RBP: ffffb7cf81853a00 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0f25f79
  R13: ffff8db1f0fa6000 R14: ffff8db2c63ff000 R15: 00000000000000e8
  FS:  0000000000000000(0000) GS:ffff8db33bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000559758db0f20 CR3: 0000000105124000 CR4: 00000000003506f0
  Call Trace:
   <TASK>
   _raw_spin_unlock_bh+0x31/0x40
   rxe_post_send+0x59/0x8b0 [rdma_rxe]
   ib_send_mad+0x26b/0x470 [ib_core]
   ib_post_send_mad+0x150/0xb40 [ib_core]
   ? cm_form_tid+0x5b/0x90 [ib_cm]
   ib_send_cm_req+0x7c8/0xb70 [ib_cm]
   rdma_connect_locked+0x433/0x940 [rdma_cm]
   nvme_rdma_cm_handler+0x5d7/0x9c0 [nvme_rdma]
   cma_cm_event_handler+0x4f/0x170 [rdma_cm]
   cma_work_handler+0x6a/0xe0 [rdma_cm]
   process_one_work+0x2a9/0x580
   worker_thread+0x52/0x3f0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x109/0x140
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x2c/0x50
   </TASK>


  raw_local_irq_restore() called with IRQs enabled
  WARNING: CPU: 0 PID: 94794 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x37/0x60
  [ ... ]
  CPU: 0 PID: 94794 Comm: kworker/u4:1 Tainted: G        W   E      6.4.0-rc1 #9
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
  Workqueue: rdma_cm cma_work_handler [rdma_cm]
  RIP: 0010:warn_bogus_irq_restore+0x37/0x60
  Code: fb 01 77 36 83 e3 01 74 0e 48 8b 5d f8 c9 31 f6 89 f7 e9 ac ea 01 00 48 c7 c7 e0 52 33 b9 c6 05 bb 1c 69 01 01 e8 39 24 f0 fe <0f> 0b 48 8b 5d f8 c9 31 f6 89 f7 e9 89 ea 01 00 0f b6 f3 48 c7 c7
  RSP: 0018:ffffb7cf81853a58 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffffb7cf81853a60 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff8db2cfb1a9e8
  R13: ffff8db2cfb1a9d8 R14: ffff8db2c63ff000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff8db33bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000559758db0f20 CR3: 0000000105124000 CR4: 00000000003506f0
  Call Trace:
   <TASK>
   _raw_spin_unlock_irqrestore+0x91/0xa0
   ib_send_mad+0x1e3/0x470 [ib_core]
   ib_post_send_mad+0x150/0xb40 [ib_core]
   ? cm_form_tid+0x5b/0x90 [ib_cm]
   ib_send_cm_req+0x7c8/0xb70 [ib_cm]
   rdma_connect_locked+0x433/0x940 [rdma_cm]
   nvme_rdma_cm_handler+0x5d7/0x9c0 [nvme_rdma]
   cma_cm_event_handler+0x4f/0x170 [rdma_cm]
   cma_work_handler+0x6a/0xe0 [rdma_cm]
   process_one_work+0x2a9/0x580
   worker_thread+0x52/0x3f0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x109/0x140
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x2c/0x50
   </TASK>

Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock")
Link: https://lore.kernel.org/r/20230510035056.881196-1-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-16 21:07:33 -03:00
Bob Pearson
17eabd6a04 RDMA/rxe: Fix double unlock in rxe_qp.c
A recent patch can cause a double spin_unlock_bh() in rxe_qp_to_attr() at
line 715 in rxe_qp.c. Move the 2nd unlock into the if statement.

Fixes: f605f26ea196 ("RDMA/rxe: Protect QP state with qp->state_lock")
Link: https://lore.kernel.org/r/20230515201056.1591140-1-rpearsonhpe@gmail.com
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/27773078-40ce-414f-8b97-781954da9f25@kili.mountain
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-16 16:52:45 -03:00
Selvin Xavier
08c7f09356 RDMA/bnxt_re: Fix the page_size used during the MR creation
Driver populates the list of pages used for Memory region wrongly when
page size is more than system page size. This is causing a failure when
some of the applications that creates MR with page size as 2M.  Since HW
can support multiple page sizes, pass the correct page size while creating
the MR.

Also, driver need not adjust the number of pages when HW Queues are
created with user memory. It should work with the number of dma blocks
returned by ib_umem_num_dma_blocks. Fix this calculation also.

Fixes: 0c4dcd602817 ("RDMA/bnxt_re: Refactor hardware queue memory allocation")
Fixes: f6919d56388c ("RDMA/bnxt_re: Code refactor while populating user MRs")
Link: https://lore.kernel.org/r/1683484169-9539-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-05-12 18:30:30 -03:00
Linus Torvalds
af3877265d v6.4 merge window RDMA pull request
Usual wide collection of unrelated items in drivers:
 
 - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe,
   usnic, usnic, bnxt_re, ocrdma, iser
    * Unnecessary NULL checks
    * kmap obsolescence
    * pci_enable_pcie_error_reporting() obsolescence
    * Unused variables and macros
    * trace event related warnings
    * casting warnings
 
 - Code cleanups for irdm and erdma
 
 - EFA reporting of 128 byte PCIe TLP support
 
 - mlx5 more agressively uses the out of order HW feature
 
 - Big rework of how state machines and tasks work in rxe
 
 - Fix a syzkaller found crash netdev refcount leak in siw
 
 - bnxt_re revises their HW description header
 
 - Congestion control for bnxt_re
 
 - Use mmu_notifiers more safely in hfi1
 
 - mlx5 gets better support for PCIe relaxed ordering inside VMs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZEva5wAKCRCFwuHvBreF
 YZFmAQC9T3b/XQ3bRknYciuzbatC98o9xB0FTqmEFYGj+Y2lVAD9EEVe3HKfHfi3
 t/GxXYB5r22oxg5bgsblZfEdEdTVCg8=
 =akMm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Usual wide collection of unrelated items in drivers:

   - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5,
     rxe, usnic, usnic, bnxt_re, ocrdma, iser:
       - remove unnecessary NULL checks
       - kmap obsolescence
       - pci_enable_pcie_error_reporting() obsolescence
       - unused variables and macros
       - trace event related warnings
       - casting warnings

   - Code cleanups for irdm and erdma

   - EFA reporting of 128 byte PCIe TLP support

   - mlx5 more agressively uses the out of order HW feature

   - Big rework of how state machines and tasks work in rxe

   - Fix a syzkaller found crash netdev refcount leak in siw

   - bnxt_re revises their HW description header

   - Congestion control for bnxt_re

   - Use mmu_notifiers more safely in hfi1

   - mlx5 gets better support for PCIe relaxed ordering inside VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits)
  RDMA/efa: Add rdma write capability to device caps
  RDMA/mlx5: Use correct device num_ports when modify DC
  RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call
  RDMA/rxe: Fix spinlock recursion deadlock on requester
  RDMA/mlx5: Fix flow counter query via DEVX
  RDMA/rxe: Protect QP state with qp->state_lock
  RDMA/rxe: Move code to check if drained to subroutine
  RDMA/rxe: Remove qp->req.state
  RDMA/rxe: Remove qp->comp.state
  RDMA/rxe: Remove qp->resp.state
  RDMA/mlx5: Allow relaxed ordering read in VFs and VMs
  net/mlx5: Update relaxed ordering read HCA capabilities
  RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR
  RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write
  RDMA: Add ib_virt_dma_to_page()
  RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"
  RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame()
  RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c
  IB/hfi1: Place struct mmu_rb_handler on cache line start
  IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests
  ...
2023-04-29 17:21:24 -07:00
Linus Torvalds
b6a7828502 modules-6.4-rc1
The summary of the changes for this pull requests is:
 
  * Song Liu's new struct module_memory replacement
  * Nick Alcock's MODULE_LICENSE() removal for non-modules
  * My cleanups and enhancements to reduce the areas where we vmalloc
    module memory for duplicates, and the respective debug code which
    proves the remaining vmalloc pressure comes from userspace.
 
 Most of the changes have been in linux-next for quite some time except
 the minor fixes I made to check if a module was already loaded
 prior to allocating the final module memory with vmalloc and the
 respective debug code it introduces to help clarify the issue. Although
 the functional change is small it is rather safe as it can only *help*
 reduce vmalloc space for duplicates and is confirmed to fix a bootup
 issue with over 400 CPUs with KASAN enabled. I don't expect stable
 kernels to pick up that fix as the cleanups would have also had to have
 been picked up. Folks on larger CPU systems with modules will want to
 just upgrade if vmalloc space has been an issue on bootup.
 
 Given the size of this request, here's some more elaborate details
 on this pull request.
 
 The functional change change in this pull request is the very first
 patch from Song Liu which replaces the struct module_layout with a new
 struct module memory. The old data structure tried to put together all
 types of supported module memory types in one data structure, the new
 one abstracts the differences in memory types in a module to allow each
 one to provide their own set of details. This paves the way in the
 future so we can deal with them in a cleaner way. If you look at changes
 they also provide a nice cleanup of how we handle these different memory
 areas in a module. This change has been in linux-next since before the
 merge window opened for v6.3 so to provide more than a full kernel cycle
 of testing. It's a good thing as quite a bit of fixes have been found
 for it.
 
 Jason Baron then made dynamic debug a first class citizen module user by
 using module notifier callbacks to allocate / remove module specific
 dynamic debug information.
 
 Nick Alcock has done quite a bit of work cross-tree to remove module
 license tags from things which cannot possibly be module at my request
 so to:
 
   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area
      is active with no clear solution in sight.
 
   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags
 
 In so far as a) is concerned, although module license tags are a no-op
 for non-modules, tools which would want create a mapping of possible
 modules can only rely on the module license tag after the commit
 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin
 or tristate.conf").  Nick has been working on this *for years* and
 AFAICT I was the only one to suggest two alternatives to this approach
 for tooling. The complexity in one of my suggested approaches lies in
 that we'd need a possible-obj-m and a could-be-module which would check
 if the object being built is part of any kconfig build which could ever
 lead to it being part of a module, and if so define a new define
 -DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
 suggested would be to have a tristate in kconfig imply the same new
 -DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
 mapping to modules always, and I don't think that's the case today. I am
 not aware of Nick or anyone exploring either of these options. Quite
 recently Josh Poimboeuf has pointed out that live patching, kprobes and
 BPF would benefit from resolving some part of the disambiguation as
 well but for other reasons. The function granularity KASLR (fgkaslr)
 patches were mentioned but Joe Lawrence has clarified this effort has
 been dropped with no clear solution in sight [1].
 
 In the meantime removing module license tags from code which could never
 be modules is welcomed for both objectives mentioned above. Some
 developers have also welcomed these changes as it has helped clarify
 when a module was never possible and they forgot to clean this up,
 and so you'll see quite a bit of Nick's patches in other pull
 requests for this merge window. I just picked up the stragglers after
 rc3. LWN has good coverage on the motivation behind this work [2] and
 the typical cross-tree issues he ran into along the way. The only
 concrete blocker issue he ran into was that we should not remove the
 MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
 they can never be modules. Nick ended up giving up on his efforts due
 to having to do this vetting and backlash he ran into from folks who
 really did *not understand* the core of the issue nor were providing
 any alternative / guidance. I've gone through his changes and dropped
 the patches which dropped the module license tags where an SPDX
 license tag was missing, it only consisted of 11 drivers.  To see
 if a pull request deals with a file which lacks SPDX tags you
 can just use:
 
   ./scripts/spdxcheck.py -f \
 	$(git diff --name-only commid-id | xargs echo)
 
 You'll see a core module file in this pull request for the above,
 but that's not related to his changes. WE just need to add the SPDX
 license tag for the kernel/module/kmod.c file in the future but
 it demonstrates the effectiveness of the script.
 
 Most of Nick's changes were spread out through different trees,
 and I just picked up the slack after rc3 for the last kernel was out.
 Those changes have been in linux-next for over two weeks.
 
 The cleanups, debug code I added and final fix I added for modules
 were motivated by David Hildenbrand's report of boot failing on
 a systems with over 400 CPUs when KASAN was enabled due to running
 out of virtual memory space. Although the functional change only
 consists of 3 lines in the patch "module: avoid allocation if module is
 already present and ready", proving that this was the best we can
 do on the modules side took quite a bit of effort and new debug code.
 
 The initial cleanups I did on the modules side of things has been
 in linux-next since around rc3 of the last kernel, the actual final
 fix for and debug code however have only been in linux-next for about a
 week or so but I think it is worth getting that code in for this merge
 window as it does help fix / prove / evaluate the issues reported
 with larger number of CPUs. Userspace is not yet fixed as it is taking
 a bit of time for folks to understand the crux of the issue and find a
 proper resolution. Worst come to worst, I have a kludge-of-concept [3]
 of how to make kernel_read*() calls for modules unique / converge them,
 but I'm currently inclined to just see if userspace can fix this
 instead.
 
 [0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
 [1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
 [2] https://lwn.net/Articles/927569/
 [3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
 hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
 oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
 jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
 YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
 WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
 p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
 Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
 c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
 eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
 tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
 agiwdHUMVN7X
 =56WK
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
 "The summary of the changes for this pull requests is:

   - Song Liu's new struct module_memory replacement

   - Nick Alcock's MODULE_LICENSE() removal for non-modules

   - My cleanups and enhancements to reduce the areas where we vmalloc
     module memory for duplicates, and the respective debug code which
     proves the remaining vmalloc pressure comes from userspace.

  Most of the changes have been in linux-next for quite some time except
  the minor fixes I made to check if a module was already loaded prior
  to allocating the final module memory with vmalloc and the respective
  debug code it introduces to help clarify the issue. Although the
  functional change is small it is rather safe as it can only *help*
  reduce vmalloc space for duplicates and is confirmed to fix a bootup
  issue with over 400 CPUs with KASAN enabled. I don't expect stable
  kernels to pick up that fix as the cleanups would have also had to
  have been picked up. Folks on larger CPU systems with modules will
  want to just upgrade if vmalloc space has been an issue on bootup.

  Given the size of this request, here's some more elaborate details:

  The functional change change in this pull request is the very first
  patch from Song Liu which replaces the 'struct module_layout' with a
  new 'struct module_memory'. The old data structure tried to put
  together all types of supported module memory types in one data
  structure, the new one abstracts the differences in memory types in a
  module to allow each one to provide their own set of details. This
  paves the way in the future so we can deal with them in a cleaner way.
  If you look at changes they also provide a nice cleanup of how we
  handle these different memory areas in a module. This change has been
  in linux-next since before the merge window opened for v6.3 so to
  provide more than a full kernel cycle of testing. It's a good thing as
  quite a bit of fixes have been found for it.

  Jason Baron then made dynamic debug a first class citizen module user
  by using module notifier callbacks to allocate / remove module
  specific dynamic debug information.

  Nick Alcock has done quite a bit of work cross-tree to remove module
  license tags from things which cannot possibly be module at my request
  so to:

   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area is
      active with no clear solution in sight.

   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags

  In so far as a) is concerned, although module license tags are a no-op
  for non-modules, tools which would want create a mapping of possible
  modules can only rely on the module license tag after the commit
  8b41fc4454e ("kbuild: create modules.builtin without
  Makefile.modbuiltin or tristate.conf").

  Nick has been working on this *for years* and AFAICT I was the only
  one to suggest two alternatives to this approach for tooling. The
  complexity in one of my suggested approaches lies in that we'd need a
  possible-obj-m and a could-be-module which would check if the object
  being built is part of any kconfig build which could ever lead to it
  being part of a module, and if so define a new define
  -DPOSSIBLE_MODULE [0].

  A more obvious yet theoretical approach I've suggested would be to
  have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
  well but that means getting kconfig symbol names mapping to modules
  always, and I don't think that's the case today. I am not aware of
  Nick or anyone exploring either of these options. Quite recently Josh
  Poimboeuf has pointed out that live patching, kprobes and BPF would
  benefit from resolving some part of the disambiguation as well but for
  other reasons. The function granularity KASLR (fgkaslr) patches were
  mentioned but Joe Lawrence has clarified this effort has been dropped
  with no clear solution in sight [1].

  In the meantime removing module license tags from code which could
  never be modules is welcomed for both objectives mentioned above. Some
  developers have also welcomed these changes as it has helped clarify
  when a module was never possible and they forgot to clean this up, and
  so you'll see quite a bit of Nick's patches in other pull requests for
  this merge window. I just picked up the stragglers after rc3. LWN has
  good coverage on the motivation behind this work [2] and the typical
  cross-tree issues he ran into along the way. The only concrete blocker
  issue he ran into was that we should not remove the MODULE_LICENSE()
  tags from files which have no SPDX tags yet, even if they can never be
  modules. Nick ended up giving up on his efforts due to having to do
  this vetting and backlash he ran into from folks who really did *not
  understand* the core of the issue nor were providing any alternative /
  guidance. I've gone through his changes and dropped the patches which
  dropped the module license tags where an SPDX license tag was missing,
  it only consisted of 11 drivers. To see if a pull request deals with a
  file which lacks SPDX tags you can just use:

    ./scripts/spdxcheck.py -f \
	$(git diff --name-only commid-id | xargs echo)

  You'll see a core module file in this pull request for the above, but
  that's not related to his changes. WE just need to add the SPDX
  license tag for the kernel/module/kmod.c file in the future but it
  demonstrates the effectiveness of the script.

  Most of Nick's changes were spread out through different trees, and I
  just picked up the slack after rc3 for the last kernel was out. Those
  changes have been in linux-next for over two weeks.

  The cleanups, debug code I added and final fix I added for modules
  were motivated by David Hildenbrand's report of boot failing on a
  systems with over 400 CPUs when KASAN was enabled due to running out
  of virtual memory space. Although the functional change only consists
  of 3 lines in the patch "module: avoid allocation if module is already
  present and ready", proving that this was the best we can do on the
  modules side took quite a bit of effort and new debug code.

  The initial cleanups I did on the modules side of things has been in
  linux-next since around rc3 of the last kernel, the actual final fix
  for and debug code however have only been in linux-next for about a
  week or so but I think it is worth getting that code in for this merge
  window as it does help fix / prove / evaluate the issues reported with
  larger number of CPUs. Userspace is not yet fixed as it is taking a
  bit of time for folks to understand the crux of the issue and find a
  proper resolution. Worst come to worst, I have a kludge-of-concept [3]
  of how to make kernel_read*() calls for modules unique / converge
  them, but I'm currently inclined to just see if userspace can fix this
  instead"

Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]

* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
  module: add debugging auto-load duplicate module support
  module: stats: fix invalid_mod_bytes typo
  module: remove use of uninitialized variable len
  module: fix building stats for 32-bit targets
  module: stats: include uapi/linux/module.h
  module: avoid allocation if module is already present and ready
  module: add debug stats to help identify memory pressure
  module: extract patient module check into helper
  modules/kmod: replace implementation with a semaphore
  Change DEFINE_SEMAPHORE() to take a number argument
  module: fix kmemleak annotations for non init ELF sections
  module: Ignore L0 and rename is_arm_mapping_symbol()
  module: Move is_arm_mapping_symbol() to module_symbol.h
  module: Sync code of is_arm_mapping_symbol()
  scripts/gdb: use mem instead of core_layout to get the module address
  interconnect: remove module-related code
  interconnect: remove MODULE_LICENSE in non-modules
  zswap: remove MODULE_LICENSE in non-modules
  zpool: remove MODULE_LICENSE in non-modules
  x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
  ...
2023-04-27 16:36:55 -07:00