61 Commits

Author SHA1 Message Date
Jason Gunthorpe
6a217437f9 Merge branch 'sg_nents' into rdma.git for-next
From Maor Gottlieb
====================

Fix the use of nents and orig_nents in the sg table append helpers. The
nents should be used by the DMA layer to store the number of DMA mapped
sges, the orig_nents is the number of CPU sges.

Since the sg append logic doesn't always create a SGL with exactly
orig_nents entries store a total_nents as well to allow the table to be
properly free'd and reorganize the freeing logic to share across all the
use cases.

====================

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

* 'sg_nents':
  RDMA: Use the sg_table directly and remove the opencoded version from umem
  lib/scatterlist: Fix wrong update of orig_nents
  lib/scatterlist: Provide a dedicated function to support table append
2021-08-30 09:49:59 -03:00
Bob Pearson
e2a05339fa RDMA/rxe: Use the correct size of wqe when processing SRQ
The memcpy() that copies a WQE from a SRQ the QP uses an incorrect size.
The size should have been the size of the rxe_send_wqe struct not the size
of a pointer to it. The result is that IO operations using a SRQ on the
responder side will fail.

Fixes: ec0fa2445c18 ("RDMA/rxe: Fix over copying in get_srq_wqe")
Link: https://lore.kernel.org/r/20210729220039.18549-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-02 12:45:22 -03:00
Bob Pearson
1117f26ea7 RDMA/rxe: Move ICRC generation to a subroutine
Isolate ICRC generation into a single subroutine named rxe_generate_icrc()
in rxe_icrc.c. Remove scattered crc generation code from elsewhere.

Link: https://lore.kernel.org/r/20210707040040.15434-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-07-16 12:43:34 -03:00
Dan Carpenter
36941dfe0e RDMA/rxe: Missing unlock on error in get_srq_wqe()
This error path needs to unlock before returning.

Fixes: ec0fa2445c18 ("RDMA/rxe: Fix over copying in get_srq_wqe")
Link: https://lore.kernel.org/r/YNXUCmnPsSkPyhkm@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Majd Dibbiny <majd@nvidia.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-25 12:00:28 -03:00
Bob Pearson
3896bde92d RDMA/rxe: Fix extra copy in prepare_ack_packet
Currently prepare_ack_packet writes almost all the fields of the BTH in
the ack packet twice. Replace code with the subroutine init_bth().

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20210618045742.204195-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-22 15:38:53 -03:00
Bob Pearson
ec0fa2445c RDMA/rxe: Fix over copying in get_srq_wqe
Currently get_srq_wqe() in rxe_resp.c copies the maximum possible number
of bytes from the wqe into the QPs copy of the SRQ wqe. This is usually
extra work and risks reading past the end of the SRQ circular buffer if
the SRQ is configured with less than the maximum possible number of SGEs.

Check the number of SGEs is not too large.
Compute the actual number of bytes in the WR and copy only those.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20210618045742.204195-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-22 15:38:52 -03:00
Bob Pearson
1993cbed65 RDMA/rxe: Fix extra copies in build_rdma_network_hdr
build_rdma_network_hdr() in rxe_resp.c does more copying than is
needed. Remove this subroutine and eliminate the extra copies for IPV6 and
reduce the extra copying for IPV4.

Fixes: e404f945a610 ("IB/rxe: improved debug prints & code cleanup")
Link: https://lore.kernel.org/r/20210618045742.204195-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-22 15:38:52 -03:00
Bob Pearson
fceb24a73e RDMA/rxe: Fix useless copy in send_atomic_ack
In send_atomic_ack() in rxe_resp.c there is code copying ack_pkt into the
skb->cb[]. This doesn't do anything useful because the cb[] is not used in
the transmit path by the rxe driver.

Remove this code.

Fixes: 4c93496f18ce ("IB/rxe: do not copy extra stack memory to skb")
Link: https://lore.kernel.org/r/20210618045742.204195-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-22 15:38:52 -03:00
Bob Pearson
cdd0b85675 RDMA/rxe: Implement memory access through MWs
Add code to implement memory access through memory windows.

Link: https://lore.kernel.org/r/20210608042552.33275-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:51:18 -03:00
Bob Pearson
3902b429ca RDMA/rxe: Implement invalidate MW operations
Implement invalidate MW and cleaned up invalidate MR operations.

Added code to perform remote invalidate for send with invalidate.  Added
code to perform local invalidation. Deleted some blank lines in rxe_loc.h.

Link: https://lore.kernel.org/r/20210608042552.33275-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:51:18 -03:00
Bob Pearson
15ae1375ea RDMA/rxe: Fix qp reference counting for atomic ops
Currently the rdma_rxe driver attempts to protect atomic responder
resources by taking a reference to the qp which is only freed when the
resource is recycled for a new read or atomic operation. This means that
in normal circumstances there is almost always an extra qp reference once
an atomic operation has been executed which prevents cleaning up the qp
and associated pd and cqs when the qp is destroyed.

This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
destroyed while a peer is retrying an atomic op it will cause the
operation to fail which is acceptable.

Link: https://lore.kernel.org/r/20210604230558.4812-1-rpearsonhpe@gmail.com
Reported-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Fixes: 86af61764151 ("IB/rxe: remove unnecessary skb_clone")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:20:23 -03:00
Bob Pearson
5bcf5a59c4 RDMA/rxe: Protext kernel index from user space
In order to prevent user space from modifying the index that belongs to
the kernel for shared queues let the kernel use a local copy of the index
and copy any new values of that index to the shared rxe_queue_bus struct.

This adds more switch statements which decreases the performance of the
queue API. Move the type into the parameter list for these functions so
that the compiler can optimize out the switch statements when the explicit
type is known. Modify all the calls in the driver on performance paths to
pass in the explicit queue type.

Link: https://lore.kernel.org/r/20210527194748.662636-4-rpearsonhpe@gmail.com
Link: https://lore.kernel.org/linux-rdma/20210526165239.GP1002214@@nvidia.com/
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-03 15:53:01 -03:00
Bob Pearson
ea49225189 RDMA/rxe: Fix missing acks from responder
All responder errors from request packets that do not consume a receive
WQE fail to generate acks for RC QPs.  This patch corrects this behavior
by making the flow follow the same path as request packets that do consume
a WQE after the completion.

Link: https://lore.kernel.org/r/20210402001016.3210-1-rpearson@hpe.com
Link: https://lore.kernel.org/linux-rdma/1a7286ac-bcea-40fb-2267-480134dd301b@gmail.com/
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-08 15:59:28 -03:00
Bob Pearson
364e282c4f RDMA/rxe: Split MEM into MR and MW
In the original rxe implementation it was intended to use a common object
to represent MRs and MWs but they are different enough to separate these
into two objects.

This allows replacing the mem name with mr for MRs which is more
consistent with the style for the other objects and less likely to be
confusing. This is a long patch that mostly changes mem to mr where it
makes sense and adds a new rxe_mw struct.

Link: https://lore.kernel.org/r/20210325212425.2792-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-30 17:11:30 -03:00
Jason Gunthorpe
7289e26f39 Linux 5.11
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmAppPgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGeXYH/imZPBd4A1jIMehN
 5HV2A53Z+MXmmaMuGj9X1KV6vsf55/xB+IhOoFdtRAIsO8c2yYSCO8i4+4R0XfYA
 +/YFJeq672rojQnmh6XbpR8dugaAV7CUHy6n7KDsyvtT6EOCpwFSwkOb4X3tBRX6
 TlYgm2d/xgV/wRHSgLVugK0MdFCLMAnyb7mkPfar9QrMgG1BiDKLq07xmwnS23On
 TkqpJ9yZ/rJpUrrUqQYPShSO/FmA+fSfWs0CDv7EIrJ40LUScD6PZxSHWTIHtjLk
 E4jFda6wuqLRVWsBwaBzUIdD0zk7X5quHRzEpbC5ga16SK6yrWvE5YJJXCguIEuZ
 f3FMRYs=
 =CAjn
 -----END PGP SIGNATURE-----

Merge tag 'v5.11' into rdma.git for-next

Linux 5.11

Merged to resolve conflicts with RDMA rc commits

- drivers/infiniband/sw/rxe/rxe_net.c
  The final logic is to call rxe_get_dev_from_net() again with the master
  netdev if the packet was rx'd on a vlan. To keep the elimination of the
  local variables requires a trivial edit to the code in -rc

Link: https://lore.kernel.org/r/20210210131542.215ea67c@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-02-18 11:19:29 -04:00
Bob Pearson
bf139b58af RDMA/rxe: Remove unused pkt->offset
The pkt->offset field is never used except to assign it to 0.  But it adds
lots of unneeded code. This patch removes the field and related code. This
causes a measurable improvement in performance.

Link: https://lore.kernel.org/r/20210211210455.3274-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-02-16 14:42:59 -04:00
Bob Pearson
899aba891c RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()
rxe_udp_encap_recv() drops the reference to rxe->ib_dev taken by
rxe_get_dev_from_net() which should be held until each received skb is
freed. This patch moves the calls to ib_device_put() to each place a
received skb is freed. It also takes references to the ib_device for each
cloned skb created to process received multicast packets.

Fixes: 4c173f596b3f ("RDMA/rxe: Use ib_device_get_by_netdev() instead of open coding")
Link: https://lore.kernel.org/r/20210128233318.2591-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-02-08 15:33:51 -04:00
Martin Wilck
f1b0a8ea9f Revert "RDMA/rxe: Remove VLAN code leftovers from RXE"
This reverts commit b2d2440430c0fdd5e0cad3efd6d1c9e3d3d02e5b.

It's true that creating rxe on top of 802.1q interfaces doesn't work.
Thus, commit fd49ddaf7e26 ("RDMA/rxe: prevent rxe creation on top of vlan
interface") was absolutely correct.

But b2d2440430c0 was incorrect assuming that with this change, RDMA and
VLAN don't work togehter at all. It just has to be set up
differently. Rather than creating rxe on top of the VLAN interface, rxe
must be created on top of the physical interface.  RDMA then works just
fine through VLAN interfaces on top of that physical interface, via the
"upper device" logic.

This is hard to see in the rxe logic because it never talks about vlan,
but instead rxe carefully selects upper vlan netdevices when working with
packets which in turn imply certain vlan tagging. This is all done
correctly and interacts with the gid table with VLAN support the same as
real HW does.

b2d2440430c0 broke this setup deliberately and should thus be
reverted. Also, b2d2440430c0 removed rxe_dma_device(), so adapt the revert
to discard that hunk.

Fixes: b2d2440430c0 ("RDMA/rxe: Remove VLAN code leftovers from RXE")
Link: https://lore.kernel.org/r/20210120161913.7347-1-mwilck@suse.com
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-01-20 13:29:28 -04:00
Zhu Yanjun
b2d2440430 RDMA/rxe: Remove VLAN code leftovers from RXE
Since the commit fd49ddaf7e26 ("RDMA/rxe: prevent rxe creation on top of
vlan interface") does not permit rxe on top of vlan device, all the stuff
related with vlan should be removed.

Fixes: fd49ddaf7e26 ("RDMA/rxe: prevent rxe creation on top of vlan interface")
Link: https://lore.kernel.org/r/1604326422-18625-1-git-send-email-yanjunz@nvidia.com
Signed-off-by: Zhu Yanjun <yanjunz@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12 11:38:27 -04:00
Bob Pearson
63fa15dbd4 RDMA/rxe: Add SPDX hdrs to rxe source files
Add SPDX headers to all rxe .c and .h files.

Link: https://lore.kernel.org/r/20200827145439.2273-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-31 12:20:02 -03:00
Steve Wise
2030abddec rxe: correctly calculate iCRC for unaligned payloads
If RoCE PDUs being sent or received contain pad bytes, then the iCRC
is miscalculated, resulting in PDUs being emitted by RXE with an incorrect
iCRC, as well as ingress PDUs being dropped due to erroneously detecting
a bad iCRC in the PDU.  The fix is to include the pad bytes, if any,
in iCRC computations.

Note: This bug has caused broken on-the-wire compatibility with actual
hardware RoCE devices since the soft-RoCE driver was first put into the
mainstream kernel.  Fixing it will create an incompatibility with the
original soft-RoCE devices, but is necessary to be compatible with real
hardware devices.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Steve Wise <larrystevenwise@gmail.com>
Link: https://lore.kernel.org/r/20191203020319.15036-2-larrystevenwise@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-12-09 13:55:26 -05:00
Konstantin Taranov
bdce129049 RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
Calculate the correct byte_len on the receiving side when a work
completion is generated with IB_WC_RECV_RDMA_WITH_IMM opcode.

According to the IBA byte_len must indicate the number of written bytes,
whereas it was always equal to zero for the IB_WC_RECV_RDMA_WITH_IMM
opcode, even though data was transferred.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Konstantin Taranov <konstantin.taranov@inf.ethz.ch>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-08 16:40:15 -03:00
Zhu Yanjun
9802c335e7 IB/rxe: Remove unnecessary rxe variable
The variable rxe in the function is not used. So it is removed.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-01-21 16:46:08 -07:00
Zhu Yanjun
a854b1e890 IB/rxe: move the variable into the function that uses it
The variable rxe is only used in the function rxe_xmit_packet, and the
caller functions do not use it. So move this variable into the function
rxe_xmit_packet.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-08 14:22:54 -07:00
Andrew Boyer
6e5559b275 RDMA/rxe: Add link_down, rdma_sends, rdma_recvs stats counters
link_down is self-explanatory.

rdma_sends and rdma_recvs count the number of RDMA Send and RDMA Receive
operations completed successfully. This is different from the existing
sent_pkts and rcvd_pkts counters because the existing counters measure
packets, not RDMA operations.

ack_deffered is renamed to ack_deferred to fix the spelling.

out_of_sequence is renamed to out_of_seq_request to make clear that it is
counting only requests and not other packets which can be out of sequence.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-11-08 14:22:54 -07:00
Sagi Grimberg
e48d8ed9c6 rxe: fix error completion wr_id and qp_num
Error completions must still contain a valid wr_id and
qp_num such that the consumer can rely on. Correctly
fill these fields in receive error completions.

Reported-by: Walker Benjamin <benjamin.walker@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Tested-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-11-06 16:25:04 -05:00
Zhu Yanjun
4e588c8d03 IB/rxe: clean skb queue directly
When resp is in error state, the queued SKBs will not be handled.
The function get_req cleans up the skb queue directly.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-11-06 16:03:14 -05:00
Vijay Immanuel
b97db58557 IB/rxe: fix for duplicate request processing and ack psns
Don't reset the resp opcode for a replayed read response.
The resp opcode could be in the middle of a write or send
sequence, when the duplicate read request was received.
An example sequence is as follows:
- Receive read request for 12KB PSN 20. Transmit read response
  first, middle and last with PSNs 20,21,22.
- Receive write first PSN 23.
  At this point the resp psn is 24 and resp opcode is write first.
- The sender notices that PSN 20 is dropped and retransmits.
  Receive read request for 12KB PSN 20. Transmit read response
  first, middle and last with PSNs 20,21,22. The resp opcode is
  set to -1, the resp psn remains 24.
- Receive write first PSN 23. This is processed by duplicate_request().
  The resp opcode remains -1 and resp psn remains 24.
- Receive write middle PSN 24. check_op_seq() reports a missing
  first error since the resp opcode is -1.

When sending an ack for a duplicate send or write request,
use the psn of the previous ack sent. Do not use the psn
of a read response for the ack.
An example sequence is as follows:
- Receive write PSN 30. Transmit ACK for PSN 30.
- Receive read request 4KB PSN 31. Transmit read response with
  PSN 31. The resp psn is now 32.
- The sender notices that PSN 30 is dropped and retransmits.
  Receive write PSN 30. duplicate_request() sends an ACK with
  PSN 31. That is incorrect since PSN 31 was a read request.

Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30 17:22:12 -04:00
Parav Pandit
3db2bceb29 IB/rxe: Simplify rxe_find_route() to avoid GID query for netdev
rxe_prepare() is called on an skb which has ndev already initialized by
rxe_init_packet().
Therefore avoid querying the GID attribute again and use the available
netdevice from the skb->dev.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30 16:31:50 -04:00
Vijay Immanuel
92cf36eec2 IB/rxe: support for 802.1q VLAN on the listener
Set the vlan flag and vlan_id field in the wc for rdma_listen()
to work over VLAN. This is required by ib_init_ah_attr_from_wc()
which is called by the CM REQ handler.

Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com>
Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:16:04 -06:00
Doug Ledford
f5e27a203f Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
Several items of conflict have arisen between the RDMA stack's for-rc
branch and upcoming for-next work:

9fd4350ba895 ("IB/rxe: avoid double kfree_skb") directly conflicts with
2e47350789eb ("IB/rxe: optimize the function duplicate_request")

Patches already submitted by Intel for the hfi1 driver will fail to
apply cleanly without this merge

Other people on the mailing list have notified that their upcoming
patches also fail to apply cleanly without this merge

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 15:48:48 -04:00
Zhu Yanjun
9fd4350ba8 IB/rxe: avoid double kfree_skb
When skb is sent, it will pass the following functions in soft roce.

rxe_send [rdma_rxe]
    ip_local_out
        __ip_local_out
        ip_output
            ip_finish_output
                ip_finish_output2
                    dev_queue_xmit
                        __dev_queue_xmit
                            dev_hard_start_xmit

In the above functions, if error occurs in the above functions or
iptables rules drop skb after ip_local_out, kfree_skb will be called.
So it is not necessary to call kfree_skb in soft roce module again.
Or else crash will occur.

The steps to reproduce:

     server                       client
    ---------                    ---------
    |1.1.1.1|<----rxe-channel--->|1.1.1.2|
    ---------                    ---------

On server: rping -s -a 1.1.1.1 -v -C 10000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 10000 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 14:20:47 -04:00
Zhu Yanjun
e12ee8ce51 IB/rxe: remove unused function variable
In the functions rxe_mem_init_dma, rxe_mem_init_user, rxe_mem_init_fast
and copy_data, the function variable rxe is not used. So this function
variable rxe is removed.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27 12:18:29 -04:00
Zhu Yanjun
fe896ceb57 IB/rxe: replace refcount_inc with skb_get
Follow the advice from Bart, the function refcount_inc is replaced
with skb_get in commit 99dae690255e ("IB/rxe: optimize mcast recv process")
and commit 86af61764151 ("IB/rxe: remove unnecessary skb_clone").

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Suggested-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19 13:58:16 -04:00
Zhu Yanjun
2e47350789 IB/rxe: optimize the function duplicate_request
In the function duplicate_request, the reference of skb can be increased
to replace the function skb_clone.

This will make rxe performace better and save memory.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19 13:58:16 -04:00
Zhu Yanjun
86af617641 IB/rxe: remove unnecessary skb_clone
In send_atomic_ack function, it is not necessary to make a
skb_clone. To gain better performance (high throughput and
low latency), this skb_clone is removed.

The following tests are made.

 server                       client
---------                    ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
---------                    ---------

On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512

The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.

This test runs for several hours. There is no memory leak and the whole
system can work well.

Based on the above network, the following tests are made.

Server: ibv_rc_pingpong -d rxe0 -g 1
Client: ibv_rc_pingpong -d rxe0 -g 1 1.1.1.1

The test results on Server(10 tests are made).
Before:
Throughput is 137.07 Mbit/sec
Latency is 517.76 usec/iter

After:
Throughput is 148.85 Mbit/sec
Latency is 476.64 usec/iter

The throughput is enhanced and the latency is reduced.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-03-07 15:56:14 -07:00
Bart Van Assche
65567e4121 RDMA/rxe: Fix a race condition in rxe_requester()
The rxe driver works as follows:
* The send queue, receive queue and completion queues are implemented as
  circular buffers.
* ib_post_send() and ib_post_recv() calls are serialized through a spinlock.
* Removing elements from various queues happens from tasklet
  context. Tasklets are guaranteed to run on at most one CPU. This serializes
  access to these queues. See also rxe_completer(), rxe_requester() and
  rxe_responder().
* rxe_completer() processes the skbs queued onto qp->resp_pkts.
* rxe_requester() handles the send queue (qp->sq.queue).
* rxe_responder() processes the skbs queued onto qp->req_pkts.

Since rxe_drain_req_pkts() processes qp->req_pkts, calling
rxe_drain_req_pkts() from rxe_requester() is racy. Hence this patch.

Reported-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-01-18 14:49:19 -05:00
Jason Gunthorpe
c966ea12c0 RDMA: Mark imm_data as be32 in the verbs uapi header
This matches what the userspace copy of this header has been doing
for a while. imm_data is an opaque 4 byte array carried over the network,
and invalidate_rkey is in CPU byte order.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-01-15 15:33:21 -07:00
Andrew Boyer
d45d29567f IB/rxe: Fix up the responder's find_resources() function
The resource array is sized by max_dest_rd_atomic, not max_rd_atomic.
Iterating over max_rd_atomic entries of qp->resp.resources[] will cause
incorrect behavior when the two attributes are different (or even
crash if max_rd_atomic is larger).

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-28 19:12:33 -04:00
Vijay Immanuel
1217197142 rxe: fix broken receive queue draining
If we modified the qp to ERROR state, and
drained the recieve queue, post_recv must
trigger the responder task to complete
the drain work request.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>--
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-07-20 11:20:50 -04:00
Kees Cook
4c93496f18 IB/rxe: do not copy extra stack memory to skb
This fixes a over-read condition detected by FORTIFY_SOURCE for this
line:

	memcpy(SKB_TO_PKT(skb), &ack_pkt, sizeof(skb->cb));

The error was:

  In file included from ./include/linux/bitmap.h:8:0,
                   from ./include/linux/cpumask.h:11,
                   from ./include/linux/mm_types_task.h:13,
                   from ./include/linux/mm_types.h:4,
                   from ./include/linux/kmemcheck.h:4,
                   from ./include/linux/skbuff.h:18,
                   from drivers/infiniband/sw/rxe/rxe_resp.c:34:
  In function 'memcpy',
      inlined from 'send_atomic_ack.constprop' at drivers/infiniband/sw/rxe/rxe_resp.c:998:2,
      inlined from 'acknowledge' at drivers/infiniband/sw/rxe/rxe_resp.c:1026:3,
      inlined from 'rxe_responder' at drivers/infiniband/sw/rxe/rxe_resp.c:1286:10:
  ./include/linux/string.h:309:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
      __read_overflow2();

Daniel Micay noted that struct rxe_pkt_info is 32 bytes on 32-bit
architectures, but skb->cb is still 64.  The memcpy() over-reads 32
bytes.  This fixes it by zeroing the unused bytes in skb->cb.

Link: http://lkml.kernel.org/r/1497903987-21002-5-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Johannes Thumshirn
d52418502e IB/rxe: Don't clamp residual length to mtu
When reading a RDMA WRITE FIRST packet we copy the DMA length from the RDMA
header into the qp->resp.resid variable for later use. Later in check_rkey()
we clamp it to the MTU if the packet is an  RDMA WRITE packet and has a
residual length bigger than the MTU. Later in write_data_in() we subtract the
payload of the packet from the residual length. If the packet happens to have a
payload of exactly the MTU size we end up with a residual length of 0 despite
the packet not being the last in the conversation. When the next packet in the
conversation arrives, we don't have any residual length left and thus set the QP
into an error state.

This broke NVMe over Fabrics functionality over rdma_rxe.ko

The patch was verified using the following test.

 # echo eth0 > /sys/module/rdma_rxe/parameters/add
 # nvme connect -t rdma -a 192.168.155.101 -s 1023 -n nvmf-test
 # mkfs.xfs -fK /dev/nvme0n1
 meta-data=/dev/nvme0n1           isize=256    agcount=4, agsize=65536 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=0        finobt=0, sparse=0
 data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
 log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
 # mount /dev/nvme0n1 /tmp/
 [  148.923263] XFS (nvme0n1): Mounting V4 Filesystem
 [  148.961196] XFS (nvme0n1): Ending clean mount
 # dd if=/dev/urandom of=test.bin bs=1M count=128
 128+0 records in
 128+0 records out
 134217728 bytes (134 MB, 128 MiB) copied, 0.437991 s, 306 MB/s
 # sha256sum test.bin
 cde42941f045efa8c4f0f157ab6f29741753cdd8d1cff93a6b03649d83c4129a  test.bin
 # cp test.bin /tmp/
 sha256sum /tmp/test.bin
 cde42941f045efa8c4f0f157ab6f29741753cdd8d1cff93a6b03649d83c4129a  /tmp/test.bin

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-05-01 14:42:58 -04:00
Yonatan Cohen
0b1e5b99a4 IB/rxe: Add port protocol stats
Expose new counters using the get_hw_stats callback.
We expose the following counters:

+---------------------+----------------------------------------+
|      Name           |           Description                  |
|---------------------+----------------------------------------|
|sent_pkts            | number of sent pkts                    |
|---------------------+----------------------------------------|
|rcvd_pkts            | number of received packets             |
|---------------------+----------------------------------------|
|out_of_sequence      | number of errors due to packet         |
|                     | transport sequence number              |
|---------------------+----------------------------------------|
|duplicate_request    | number of received duplicated packets. |
|                     | A request that previously executed is  |
|                     | named duplicated.                      |
|---------------------+----------------------------------------|
|rcvd_rnr_err         | number of received RNR by completer    |
|---------------------+----------------------------------------|
|send_rnr_err         | number of sent RNR by responder        |
|---------------------+----------------------------------------|
|rcvd_seq_err         | number of out of sequence packets      |
|                     | received                               |
|---------------------+----------------------------------------|
|ack_deffered         | number of deferred handling of ack     |
|                     | packets.                               |
|---------------------+----------------------------------------|
|retry_exceeded_err   | number of times retry exceeded         |
|---------------------+----------------------------------------|
|completer_retry_err  | number of times completer decided to   |
|                     | retry                                  |
|---------------------+----------------------------------------|
|send_err             | number of failed send packet           |
+---------------------+----------------------------------------+

Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-04-21 10:43:28 -04:00
David Marchand
9fcd67d177 IB/rxe: increment msn only when completing a request
According to C9-147, MSN should only be incremented when the last packet of
a multi packet request has been received.

"Logically, the requester associates a sequential Send Sequence Number
(SSN) with each WQE posted to the send queue. The SSN bears a one-
to-one relationship to the MSN returned by the responder in each re-
sponse packet. Therefore, when the requester receives a response, it in-
terprets the MSN as representing the SSN of the most recent request
completed by the responder to determine which send WQE(s) can be
completed."

Fixes: 8700e3e7c485 ("Soft RoCE driver")

Signed-off-by: David Marchand <david.marchand@6wind.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-03-24 22:07:27 -04:00
Doug Ledford
6dd7abae71 Merge branch 'k.o/for-4.10-rc' into HEAD 2017-02-19 09:18:21 -05:00
Eyal Itkin
628f07d33c IB/rxe: Fix resid update
Update the response's resid field when larger than MTU, instead of only
updating the local resid variable.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Eyal Itkin <eyal.itkin@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-02-08 12:28:30 -05:00
Bart Van Assche
839f5ac0d8 IB/rxe: Remove a pointless indirection layer
Neither rxe->ifc_ops nor any of the function pointers in struct
struct rxe_ifc_ops ever change. Hence remove the rxe->ifc_ops
indirection mechanism.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Cc: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-10 16:52:47 -05:00
Bart Van Assche
ab17654476 IB/rxe: Fix reference leaks in memory key invalidation code
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Cc: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-10 16:52:47 -05:00
Bart Van Assche
b3a4599610 IB/rxe: Fix a MR reference leak in check_rkey()
Avoid that calling check_rkey() for mem->state == RXE_MEM_STATE_FREE
triggers an MR reference leak.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Cc: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-10 16:52:47 -05:00
Bart Van Assche
18d3451c0d IB/rxe: Generate a completion for all failed work requests
Change do_complete() such that an error completion is not only
generated if a QP is in the error state but also if a work request
failed.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Andrew Boyer <andrew.boyer@dell.com>
Cc: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-10 16:52:47 -05:00